Literature DB >> 22962340

Evaluation of research in biomedical ontologies.

Robert Hoehndorf¹, Michel Dumontier, Georgios V Gkoutos.

Abstract

Ontologies are now pervasive in biomedicine, where they serve as a means to standardize terminology, to enable access to domain knowledge, to verify data consistency and to facilitate integrative analyses over heterogeneous biomedical data. For this purpose, research on biomedical ontologies applies theories and methods from diverse disciplines such as information management, knowledge representation, cognitive science, linguistics and philosophy. Depending on the desired applications in which ontologies are being applied, the evaluation of research in biomedical ontologies must follow different strategies. Here, we provide a classification of research problems in which ontologies are being applied, focusing on the use of ontologies in basic and translational research, and we demonstrate how research results in biomedical ontologies can be evaluated. The evaluation strategies depend on the desired application and measure the success of using an ontology for a particular biomedical problem. For many applications, the success can be quantified, thereby facilitating the objective evaluation and comparison of research in biomedical ontology. The objective, quantifiable comparison of research results based on scientific applications opens up the possibility for systematically improving the utility of ontologies in biomedical research.

Entities: Chemical Disease Gene Species

Keywords: biomedical ontology; evaluation criteria; ontology evaluation; ontology-based applications; quantitative biology

Mesh：

Year: 2012 PMID： 22962340 PMCID： PMC3888109 DOI： 10.1093/bib/bbs053

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

INTRODUCTION

Biomedical ontology is an emerging discipline that applies theories and methods from diverse disciplines such as philosophy, cognitive science, linguistics and formal logics to perform or improve biomedical applications. As a scientific discipline, it requires a research methodology that yields reproducible and comparable results that can be evaluated independently. Methodological progress in biomedical ontology will be recognized when different methods generate results that can be objectively compared, such that it becomes possible to evaluate whether the methods yield better results. There is considerable debate about establishing metrics for evaluating research results in applied ontology as well as determining the perspective from which its results should be evaluated [1-3]. Many evaluation strategies are based on criteria stemmed from philosophy, knowledge representation, formal logics or ‘common sense’, while an empirical, repeatable and reproducible evaluation based on the domain of application is challenging to perform [4, 5]. The absence of commonly agreed criteria for evaluating research results in biomedical ontology leads to challenges in the development of an effective research methodology for the field of biomedical ontology: before a research methodology in any scientific field can be established, it is first necessary to determine what constitutes a research result, what constitutes a ‘novel’ research result (i.e. what does it mean that two research results are different) and what constitutes a better result than another (i.e. how can two competing results be compared and evaluated). Only after these questions are answered will it be possible to design a research methodology in a scientific field than enables the field as a whole to make progress with respect to the evaluation criteria that the discipline has established. Here, we review fundamental questions pertaining to research in biomedical ontologies. We will focus on the application of ontologies in basic and translational research and will not discuss the large field of applying ontologies in health care and medicine, which is discussed elsewhere [6-9]. First, we review major applications of ontologies in biomedical research. From the perspective of an ontology user, we then discuss the problem of the ‘research question’ of biomedical ontology, i.e. what is the ‘scientific’ problem that research in biomedical ontology addresses. Third, we characterize and classify different types of research results in biomedical ontology, and finally, we discuss in depth different ways for evaluating and comparing research results in biomedical ontology. Although we will primarily focus on ontologies as they are used in biomedicine, we believe that many of our arguments will hold for research in other areas of applied ontology as well.

USES OF ONTOLOGIES IN BIOMEDICAL INVESTIGATIONS

Biomedical applications of ontologies

At the end of the 1990s and early 2000s, genetics made a leap forward with the availability of the first genome sequences for several species [10]. The availability of genome sequences for multiple species enabled comparative genomic analyses and revealed that a large part of the genetic material in different species was conserved and that many of the genes in different organisms have similar functions. The Gene Ontology (GO) [11] was designed as a controlled vocabulary to provide stable names, textual definitions and identifiers to unify descriptions of functions, processes and cellular components across databases in biology. Today, with the rise of high-throughput sequencing technology, genome sequences for thousands of species are becoming available, and large international research projects, such as the 5000 genomes project (which aims to sequence the genomes of 5000 insects and other arthropods) [12] or the Genomes 10 k project (which aims to sequence the genomes of 10 000 vertebrate species) [13], will collect even more data in the near future. High-throughput technologies are not limited to genome sequencing, but influenced other areas in biology as well, from high-throughput phenotyping (to determine the observable characteristics of organisms, often resulting from targeted mutations) [14, 15] over microarray experiments (to determine gene expression) [16] to high-throughput screening in drug discovery [17, 18]. The amount of data produced in biology today makes the design of strategies for integration of data across databases, methods for retrieving the data and developing query languages and interfaces a central and important part of research in biology. The prime purpose of ontologies, such as the GO, is to address these arising challenges in biology and biomedicine and provide a means to integrate data across multiple heterogeneous databases. To facilitate the integration of databases, retrieval of data and the provision of query languages, ontologies provided not only terms and textual definitions but also a basic structure. Initially, this structure was not expressed in a formal logic-based language. Instead, ontologies were seen as graph structures in which nodes represent terms and edges relations (such as ‘is-a’ or ‘part-of’) between them. Reasoning over these graphs was stated as operations on the graph, in particular the composition of edges and the transitive closure [11]. It was not until much later that formal languages were used to represent biomedical ontologies and recast the graph operation in terms of deductive inference over formal theories [19-22]. The graph structure of biomedical ontologies is not only a valuable feature to improve retrieval and querying but is also useful for other tasks, for example for Gene Set Enrichment Analysis (GSEA) [23] to analyse gene expression. GSEA utilizes the graph structure of the GO to determine whether a defined set of genes shows statistically significant, concordant differences between two biological states; it utilizes the annotation of sets of genes with GO terms and the GO graph structure and inference rules to statistically test for enriched GO terms. A large number of tools were developed to perform such enrichment analyses that have lead to discoveries of cancer mechanisms [23], evolutionary differences in primates [24], genes involved in particular functions, such as oxidative phosphorylation [25] and many more. GSEA is now a standard tool in many biological analyses, as evidenced by more than 3200 citations (based on Google Scholar, 5 April 2012) for the original paper. Similar enrichment analyses are now being performed using ontologies of other domains, such as the Human Disease Ontology [26]. The graph structure of ontologies is also widely utilized for semantic similarity analyses [27]. Semantic similarity measures apply a metric on an ontology in order to compare the similarity between data that are annotated with classes in the ontology [28-30]. Some metrics are based on the distance between two nodes in the ontologies’ graph structure, while others compare sets of classes that are closed with respect to relations in the ontology [31-33]. In some cases, the metrics include further information, such as the information content that a class in an ontology has within a given domain. Importantly, however, all semantic similarity measures rely on the number and the kind of distinctions that the ontology developers have made explicit. Another application of ontologies is in text mining and literature search and retrieval [34, 35]. The availability of a common terminology throughout biology enables the task of named entity recognition, i.e. the identification of standardized terms in natural language texts [36, 37]. When terms from ontologies can reliably be detected in natural language texts, ontologies can be used for retrieving text documents from literature archives such as PubMed [38]. This task is made easier when terms in ontologies are widely used, and several biomedical ontologies have been evaluated based on how well their terms can be recognized in scientific literature [39]. Furthermore, identification of ontology term labels in text can be combined with analyses over the structure of ontologies (including similarity-based analyses and enrichment analyses) to improve text-mining results based on the ontology hierarchy. Ontologies are also used as knowledge bases (or structured databases) which are primarily intended to store and expose information about a domain. Ontologies of this type are comparable to scientific databases, such as UniProt [40], in that they contain information for scientists that can be accessed on demand. Examples for this type of ontology include the various anatomy ontologies [41-47] and pathway knowledge bases such as EcoCyc and MetaCyc [48]. These ontologies can go into great detail; an ontology like the Foundational Model of Anatomy (FMA) [43] is likely the most comprehensive formal description of human anatomy and exceeds the information and the detail contained in most individual anatomy textbooks.

Ontologies as formalized theories of a domain

Although the applications of biomedical ontologies we discussed so far do not rely on formalized semantics, axioms, the use of knowledge representation languages, automated reasoning or philosophical foundations, the past years have seen a rapid increase in applying formal methods to biomedical ontologies. In particular, the Web Ontology Language [49] is now widely used to represent biomedical ontologies [19]. In some cases, more expressive languages such as first- and monadic second-order logic are used to specify ontologies, in particular for biological sequences [50, 51] and molecular structures and graphs [52]. Using these languages, knowledge about a domain is expressed following the axiomatic method [53], based on which axioms (i.e. statements that are considered to be true about the domain) are asserted and the consequences of these axioms are inferred using inference rules [54]. Automated reasoning is the process by which the inferences are deduced automatically. The stated aims of applying philosophical foundations, the axiomatic method, knowledge representation languages and automated reasoning for biomedical ontologies are manifold and include the search for philosophical rigour and a foundation in particular philosophical theories [3, 5, 55], providing expressive and machine-readable documentation of the meaning of terms in a vocabulary [51], verifying the consistency of a data model [56, 57], verifying the consistency of data with respect to a data model [56, 58], enabling complex retrieval and querying through automated reasoning [59], integrating multiple ontologies [60, 61] and decreasing the cost of developing and maintaining an ontology [62-65]. Furthermore, the application of formal methods in biomedical ontologies has the potential to reveal mistakes in the design of ontologies [5, 21, 66, 67] or to improve their utility for scientific analyses [61, 68]. Several projects have started to axiomatize biomedical ontologies [57, 61, 69, 70], and these projects have led to changes in the ontologies and the detection and removal of contradictory statements [57, 60]. Other researchers have suggested changes to improve ontologies’ structures and axioms based on applying formal, ontological and philosophical methods [21, 55, 66, 67, 71, 72], or they provide ontological interpretations of domain-specific knowledge by applying a formal ontological theory to some phenomena in a domain [67, 73–75]. Table 1 provides a list of use cases and examples for the application of formal methods in biomedical ontologies.

Table 1:

Formal approaches to ontology research and their potential impact on biomedical applications and analyses

Task	Description	Potential impact on biomedical applications and analyses	Example
Philosophical foundations	A theory from philosophy is applied either to biomedical ontologies orro biomedical domains. It is then demonstrated that the philosophical theory can explain the distinctions within the domain. Furthermore, philosophical foundation theory can provide insights into the principles based on which scientists within a domain distinguish different kinds of entities and can provide a methodology for classifying domain entities. Formalizing aspects of the philosophical principles can enable verification of a domain theory with regard to these principles.	Increased coherence in representing knowledge and data, comprehensibility and interoperability	A study demonstrated that a particular perspective on philosophical realism can be used to describe chemical structures, even when the type of structure is known to be impossible to exist [55].
Provision of unambiguous, formal documentation	The use of formal languages can remove ambiguity from specifications (of a domain, the meaning of a term, etc.). Based on formal logics, consequences of a specification can then be determined by a mathematical proof, thereby avoiding potential misunderstandings based on natural language.	Increased coherence, increased clarity	The RNA ontology (RNAO) [51] is a biomedical ontology used to describe RNA structure. The core of RNAO is formalized in first- and second-order logic. These rich formalisms are used to precisely formalize basic notions, such as the meaning of ‘molecule’ within the context of the RNAO.
Provision of machine-readable documentation	Some aspects of the meaning of terms are formalized using a knowledge representation language so that automated systems can gain access to the meaning and process it.	Automated data processing, automated knowledge- and data-integration, semantic integration	The GO [11], as well as a large number of ontologies in the OBO Foundry [1], use the OBO Flatfile Format [19] to make ‘some aspects’ of term meanings explicit. For example, the GO contains a taxonomy as well as relations about parthood and regulation. Another project is aimed at providing richer formalized definitions for the GO [57] so that more information about the terms’ meaning can be accessed automatically.
Consistency verification	Statements that are considered true in the domain (axioms) and term definitions are formalized and an automated reasoner is used to verify the consistency (i.e. the absence of contradictions) of the stated knowledge. Furthermore, ‘satisfiability’ of a class can be automatically verified (a class is satisfiable if it is possible for the class to have instances). Once a model of a domain is consistently formalized, it can be applied to verify data in this domain. For this purpose, an automated reasoner verifies whether data items satisfy the constraints expressed in the model of the domain Often, expressive automated reasoners such as OWL 2.0 reasoners are used to perform consistency verification.	Increased coherence, detection of modelling errors, detection of competing scientific theories, data coherence	Inconsistencies when combining anatomy and phenotype ontologies were detected [72] and resolved by explicitly distinguishing between normal and abnormal anatomy.
Data classification	An ontology of a domain is applied to classify data in a domain. In this task, an automated reasoner uses the constraints in the domain ontology to automatically assign data items into ontology categories.	Classification, data analysis	A study [76] formalized human knowledge about the classification of protein phosphatases in an OWL ontologies and applied automated reasoning to automatically assign classes for human and Aspergillus fumigatus proteins. An evaluation showed that ontology-based classification matches, and sometimes exceeds, human judgment.
Supporting ontology development	Formal representations and automated reasoning can support ontology development by inferring information that is not explicitly stated. Possibly undesired consequences can be examined either manually or automatically and the statements leading to the undesired consequence can be corrected. Particularly useful is the automated construction of taxonomies based on axioms in an ontology. Complex statements and definitions are automatically transformed into a generalization hierarchy (a taxonomy) by the automated reasoner.	Decreased maintenance, detection of errors	The GULO software [77] uses automated reasoning over the axioms in an ontology to improve the taxonomic structure of an ontology. Furthermore, it enables ontology developers to validate the accuracy of their definitions.
Support querying	Based on the axioms about a domain, automated reasoners can infer a potentially infinite number of statements that are true if the axioms are true. Therefore, formal logics are ideally suited to encode knowledge about a domain so that it can support a wide variety of queries. Automated reasoners are capable of automatically determining the answers to the queries using the statements in the formalized theory. This is one of the most widely used application of automated reasoning in ontologies. To efficiently support querying in applications that require quick response times, highly optimized reasoners and low expressivity of the knowledge representation language are beneficial [74].	Support knowledge extraction, connect databases and domains	A web-based query tool used the ELK reasoner [78] to query the GO and the mouse phenotype ontology as well as their annotated data [79]. Another example is the FlyBase model organism database [80] which uses the Pellet reasoner [81] to perform data queries.

Formal approaches to ontology research and their potential impact on biomedical applications and analyses

THE RESEARCH QUESTIONS OF BIOMEDICAL ONTOLOGY

The examples we discussed include the current major applications of ontologies in biomedical research, and additional ontology-based applications are developed frequently and range from novel scientific data analysis methods over the design of user interfaces to semantic publishing of scientific articles. One underlying commonality in the ontology-based applications reviewed here is that ontologies determine or guide the ‘way’ in which domain content is expressed. Research in ontology answers the questions of ‘how’ a proposed standard terminology should be built so that it satisfies the needs of multiple users in the domain, ‘how’ domain content must be expressed so that relevant retrieval operations or particular scientific analyses are supported and ‘how’ information must be formalized so that data and model consistency can be verified with regard to specific constraints. In most cases, there are multiple possibilities for structuring information within a domain and not all perform equally well. Additionally, it may be possible to identify common underlying principles of ‘how’ to structure information within a domain in order to serve particular applications. While these principles may originate from diverse disciplines, including philosophy, linguistics and cognitive science, it is their effect on biomedical applications that makes them either successful or unsuccessful choices. In this sense, the research area of ‘ontology’ is the bridge between theories originating from these diverse disciplines and the domain of application; ontology is about selecting the right way of modelling a domain for a particular application. Following this understanding of ‘ontology’, we can distinguish between several different types of research results. First, ontologies themselves are research results in biomedical ontology. An ontology is an artefact that specifies a particular set of categories that are useful and applicable for certain tasks within a domain. More than 300 ontologies in the biomedical domain are listed in the BioPortal [82] alone, and their intended applications are highly diverse, covering all the use cases we discussed so far and more. A second type of research result is an ‘ontology design pattern’, i.e. a ‘way’ to represent information so that it can be applied for a specific purpose [83]. Many of these patterns are currently implemented in domain ontologies, or arose from best practices in building ontologies. Most notably, relations in biomedical ontologies were a controversial topic for several years [66] until a set of ontology patterns was proposed that standardized the meaning of a large number of relations used in biomedical ontologies [21]. Similarly, in the biomedical domain, patterns have been proposed for expressing information about qualities [84], functions [85], dispositions [73], phenotypes [86, 87] and realizable entities [67]. Often, these patterns are motivated by theories taken from other scientific fields and applied to the field of biomedicine. For example, well-developed ontological theories of functions are available in philosophy [88], biology [89] and linguistics [90] and can be applied to formulate biological knowledge. Since not all of them will perform equally for all tasks, the evaluation of an ontology design patterns requires the application of the design pattern to a particular ontology and a measure of its impact in an ontology-based application. We consider the ‘application of a design pattern to an ontology’ a third type of research result. Applying a design patterns often involves changing an ontology so that certain information is structured according to the design pattern. This can either be done on a single place in the ontology, in order to demonstrate the consequences of applying the design pattern, or throughout an ontology. In the first case, consequences can be measured on a single example and their effect on the whole ontology could be hypothesized. Only the second case will enable the direct evaluation of all consequences. Finally, a fourth type of research result is a methodological result. Methodological advances in applied ontology may abstract from specific applications of ontologies and identify generic approaches that will lead to reproducible positive outcomes in certain scenarios. These approaches can eventually lead to guidelines for ontology quality with respect to certain application. For example, the OntoClean approach [91] is such a general method for building ontologies that are robust (i.e. re-usable across multiple applications) and comprehensible.

EVALUATION OF RESEARCH RESULTS IN BIOMEDICAL ONTOLOGY

Despite the large number of research projects that apply formal ontological theories to scientific domains, no common evaluation criteria are being applied in these studies. Similarly, the stated goals of such research are highly diverse and sometimes the impact of the research on scientific applications is not demonstrated or even discussed [2]. Examples of evaluation criteria for research in applied ontology include formal consistency of the developed theory [92], the identification of unsatisfiable classes [57, 60], conformance to a particular philosophical theory [3, 55, 93], user acceptance [94], conformance to naming conventions [95] or the recall of ontology class labels in scientific literature [39]. Only few of these criteria actually evaluate ‘what ontologies do’, while the majority of these criteria evaluate the research results based on philosophical, formal and technical criteria that lie within the domain of ontology or its underlying technologies themselves. The selection and application of evaluation criteria provides the means to distinguish research in ‘applied’ ontology from research in ‘non-applied’ ontology. In ‘applied’ ontology, ontologies are being used for some task within a domain, and that task lies usually outside of the domain of ontology itself (A notable exception to this is when we apply ontological methods to the domain of ontology itself, and classify different kinds of ontologies, their parts, analyse the types of relations between classes, relations, instances and individuals, etc. Such an ontology could, for example, be used to provide the conceptual foundation of an ontology editor, to enable interoperability between different ontology learning algorithms, in portals providing access to different ontologies, or in an ontology evaluation framework.). Consequently, evaluation criteria for research results in ‘applied’ biomedical ontology will be derived based on the task to which the result is being applied, and not from the domain of ontology itself. The search for philosophical foundation and rigor, including the demonstration that a particular philosophical theory is capable of expressing distinctions that are being made within a domain, are examples of research goals of non-applied ontology, because the aims of the research and its evaluation will generally lie within the realm of ontology, not within the domain of application. Applying a particular philosophical theory can, in many cases, improve the utility of an ontology, and demonstrating that the application of a particular philosophical perspective improves the utility of an ontology for some task in a domain would constitute a result in applied biomedical ontology. We can also distinguish between ‘who’ or ‘what’ directly benefits from a particular result of research in ontology: either the users and uses of an ontology, ontology-based applications and specific tasks to which ontologies are being applied, or the developers and maintainers of an ontology. Developers and maintainers of ontologies will benefit directly from decreased maintenance work, ease of construction and the availability of technical documentation, while users and applications of an ontology will only benefit indirectly from such research goals. Users and applications of ontologies benefit from the community agreement which ontologies can bring about and their resulting potential for ontology-based data annotation and integration, retrieval and querying, novel scientific analyses and in some cases consistency verification of data. Since, users of ontologies will benefit from something that ontologies can ‘do’, research in ‘applied’ ontology has to be measured based on how well ontologies ‘do’ their tasks. One of the most widely cited applications of ontologies in science is their potential to facilitate community agreement of the meaning of terms in a domain. These terms are frequently used as metadata in scientific databases and publications. Consequently, applying ontologies to standardize the vocabulary used as meta-data can enable the integration and interoperability of databases and research results. There are several possibilities for evaluating an ontology that is intended to effectively standardize the meaning of terms in a vocabulary and support interoperability and integration. Since the prime aim of such a research result is to achieve community agreement, an obvious evaluation criterion would be to conduct a user-study that evaluates whether different users can consistently apply terms within a standardized task such as the annotation of a data set with classes from an ontology. For this task, Kappa statistics can be applied and a κ value can be reported that measures the degree to which annotator agree [96, 97]. Kappa statistics is widely applied in computational linguistics [98], biomedical text mining [99], for the verification and disambiguation of biomedical resources [100], and to evaluate some consequences of biomedical ontologies [94]. The support of queries and retrieval of data is another task for which ontologies and their axioms are built. Information retrieval is a discipline in computer science for which rigorous quantitative evaluation criteria are available [101], often based on the comparison to a gold standard or a set of positive and negative examples based on which statistical measures can be applied. Quantitative measures include the F-measure (the harmonic mean between precision and recall) or the area-under-curve (AUC) in an analysis of the receiver operating characteristic (ROC) curve [102]. If an ontology, or axioms in an ontology, are intended for retrieval, measures that compare the inferences to a gold standard can be applied to demonstrate the success of the ontology. In many cases, axioms in ontologies are added in order to enable novel queries that make distinctions which could not be made before. For example, adding axioms about parthood to a purely taxonomic representation of anatomical structures enables new kind of queries based on the use of parthood relations. Such a result—the addition of new axioms to enable novel types of queries and retrieval operations—can be evaluated using the same quantitative measures as ontology-based retrieval. All of these descriptions assume that there is already some data which is being retrieved using queries over the ontology. In the absence of such data, e.g. when a new ontology is proposed within a domain with the intent to use this ontology to annotate data in the future, data could be simulated and then used in the evaluation. Further applications of formalized ontologies include the verification of data with respect to certain constraints that are expressed within the ontology. For example, in the domain of biological pathways, the BioPax ontology [56] has been proposed, and one of its aims is to verify pathway data with respect to the model that the BioPax ontology provides. Similarly, a recent study used formal ontological analysis and automated reasoning to investigate the consistency of a database of computational models and identified a large number of incorrectly characterized database entries [58]. A quantitative measure of success would then be the number of inconsistencies that were identified in a data set. Applications of ontology research in scientific analyses and in the process of making novel scientific discoveries are maybe the best evaluated contributions in applied ontology, since the contributions that ontology research can make in these areas is commonly subject to the same evaluation criteria as other contributions in the scientific domain of application. For example, the GSEA method was evaluated both using statistical measures and experimentally verified data that has been extensively studied [23, 25], and the use of semantic similarity measures to identify interacting proteins based on GO is rigorously evaluated and compared using ROC and correlation coefficient analysis [103]. In each case, the scientific domain to which ontology-based methods are being applied has established, and often demands, quantitative evaluation criteria that can ensure the objective and empirical evaluation and comparison of research results. Furthermore, an integrated scientific analysis of the data in multiple databases between which interoperability is intended to be achieved can be performed and evaluated on a scientific use case. For example, the development of formal definitions for phenotype ontologies [86] can be quantitatively evaluated by using these definitions to integrate multiple model organism databases and analyse the integrated knowledge with regard to its potential for revealing novel candidate genes for diseases [68]. There are several other tasks that may fall in the domain of applied ontology research. For example, formal ontological analysis can be applied to specify a conceptual model, verify its consistency and identify modelling choices that potentially lead to faulty results; or formal ontology can be applied to formally specify the meaning of terms in a vocabulary (e.g. to enable communication between autonomous intelligent agents). Some of these tasks can also be evaluated quantitatively: while consistency of a conceptual model is a binary quality that relies on a consistency proof, incorrect consequences can be estimated using predefined tests that aim to make inferences of a certain kind [104]. A formal specification of the meaning of a term using an ontology can be accompanied by a meta-theoretical analysis and a completeness proof for the ontology [105]. Depending on the application to which ontology-based research is applied, we can derive quality criteria, some of which are illustrated in Table 2. The heterogeneity of ontology-based applications prevents the application of a single quality and evaluation criterion. Instead, research results in biomedical ontology must be evaluated in conjunction with a task to which this result is being applied. For example, instead of evaluating the quality of an ontology O that represents biological pathways, we have to evaluate O with respect to different tasks that it is intended to perform. For example, O may be used to achieve community agreement about the terms used to annotate pathway databases (task ), and we can evaluate O with respect to . On the other hand, O may also be used to verify the consistency of biological pathway data (task ), and we can evaluate O with respect to . A consequence could be that O achieves one task very well while its performance in a second task is poor.

Table 2:

Opportunities for the quantitative evaluation of research results in applied ontology

Application	Possible evaluation methods	Description	Quantifiable result	Example
Establish ‘community agreement’ about meanings of terms in a vocabulary. In a domain in which terms can have different meanings based on the background of a researcher, an ontology is developed to provide a reference for ‘particular’ meanings of terms.	User-study	Multiple people perform a task, such as determining the occurrence of terms from an ontology in a manuscript, independently. The goal is to achieve a high agreement between annotators about the ontology terms that have occurred in the manuscript.	Percentage agreement, κ statistics	A study was performed to evaluate the agreement between expert curators of the GO and found ‘that there is 39% chance of curators exactly interpreting the text and selecting the same GO term, a 43% chance that they will extract a term from new/different lineage and a 19% chance that they will annotate a term from the same GO lineage’ [106].
‘Annotate data consistently’ across multiple databases, user communities or domains.	User-study	Multiple people annotate the same data set using an ontology. The goal is to achieve a high agreement in the resulting annotations.	Percentage agreement, κ statistics	A study was performed to evaluate GO annotation consistency between human and mouse. The authors find that, out of a set of 3359 annotations, 2137 are matches and 1222 are mismatches (and potential annotation inconsistencies) [107].
‘Integrate multiple databases’ and provide a uniform view across.	User study, integrated analysis	An evaluation can perform an analysis of an integrated data set, or compare the integration results to a gold standard. Integrated analysis results can either be compared to a reference or tested based on a scientific use case.	Integrated data analysis results, precision, recall, F-measure	The phenotype data contained in multiple model organism databases were integrated and utilized for the task of prioritization of candidate genes for a disease. The results were compared against gene–disease associations in the OMIM database (gold standard) and quantitatively evaluated using ROC analysis [68].
‘Answer queries over data’ using the ontology as the conceptual model of a database (i.e. the classes and relations in the ontology are used to structure the database and functions as a vocabulary based on which queries can be built).	Test suite, comparison to gold standard	An evaluation can be based on a test suite (in which particular queries and the desired results are specified) or a gold standard, and use the ontology to perform test queries over the database and determine if the results conform to the desired outcome or compare the results to the gold standard. Additionally, a performance analysis can be used to determine the time and space required to implement the queries.	Number of tests passed, precision, recall, F-measure; complexity class, performance measurements	A study implemented an RDF-based query system over biomedical ontologies together with several relation axioms, demonstrating several queries that could not be answered before. The evaluation found that ‘the answers to such a query are complete and they correspond to the logical meaning of the relation types as intended by the ontology engineers’ [108].
‘Answer questions’ over the knowledge contained in the ontology.	Test suite, content evaluation	Evaluation can take several directions. A test suite of questions can be designed, the ontology used to answer these questions, and the results compared to the outcome. Furthermore, the content of the ontology can be evaluated similarly to database content evaluation [109].	Number of tests passed, domain coverage (percentage), currency (number of times updated), expert evaluation	A study evaluated whether questions about existential restrictions in biomedical ontologies are correct as judged by experts in the field. The results show that, ‘[a]ccording to a rating done by four experts, 23% of all existential restrictions in OBO Foundry candidate ontologies are suspicious (Cohens’ κ = 0.78)’ [94].
Determine ‘consistency of data’ with respect to constraints in the ontology.	Test suite, performance measurement	A test suite of different types of data inconsistencies can be designed, and a performance evaluation used to measure the time and space complexity for identifying inconsistencies.	Number of tests passed (contradictions found); complexity class	A top-level ontology of computation models in systems biology (consisting of less than 10 classes and less than 10 relations) was formalized in OWL and the models in the BioModels database [110] were verified with regard to this ontology. As a consequence, the study detects several contradictions in the BioModels knowledge base, arising from annotations in 27 models [58].
Determine the ‘consistency and accuracy of the conceptual model’.	Automated reasoning, test suite	Automated reasoning can be used to determine model consistency, and a test suite can be used to test accuracy of consequences following from the model.	Number of tests passed, number of inconsistencies found	A project to formalize the definitions of GO terms [57] has detected 7397 unsatisfiable classes in GOs definitions, 3487 in MPs definitions and 1017 in HPOs definitions. For example, ‘system process’ and ‘cellular process’ were declared as disjoint classes, but ‘leukocyte activation’ was inferred to be both a subclass of ‘cellular process’ and of (immune) ‘system process’ [60].
Enable ‘novel scientific analyses’, such as Gene Set Enrichment Analysis (GSEA) or semantic similarity, that rely on the type and the number of distinctions made in an ontology to analyse a data set.	Case-specific scientific validation	Evaluation must be based on the specific scientific problem and the standard established for the particular scientific discipline. An example for an evaluation could be to perform an experiment.	Various quantifiable results, including p-value, F-measure, ROC AUC	The novel method GSEA was proposed, that utilizes the annotations and the structure of GO to interpret gene expression data [23]. Results of GSEA were compared to published results and experimentally validated.

Opportunities for the quantitative evaluation of research results in applied ontology ‘Robustness’ can then be evaluated based on evaluating an ontology (or another research result in biomedical ntology) on multiple tasks: if the ontology performs well in multiple heterogeneous tasks, the ontology is ‘robust’. Additionally, it becomes possible to evaluate how much the quantitative results change under changing application conditions.

ONTOLOGY PEER REVIEW

Several evaluation methods for research in applied ontology have been proposed, and multiple studies have attempted to evaluate the quality of ontologies in biomedicine. Currently, there is little emphasis on the need for objective, quantitative evaluation criteria for applied ontology research; on the contrary, many quality criteria are derived from philosophical and social considerations. In particular, several studies emphasize the need to treat ontologies similarly to scientific publications and propose an evaluation strategy similar to scientific peer review. For example, Obrst et al. [4] aim to identify ‘meaningful, theoretically grounded units of measure in [ontology]’ and perform an extensive review of previous ontology evaluation attempts, including a brief discussion of application-based evaluation approaches and quantifiable results. However, Obrst et al. dismiss application-based evaluation strategies since they are ‘expensive to carry out’, and instead propose ontology evaluation by humans based on principles derived from common sense, formal logics or philosophy (especially in the form of philosophical realism). A similar route is being taken by Smith who suggests that peer review of ontologies should become standard practice, since ‘[p]eer review provides an impetus to the improvement of scientific knowledge over time’ [1, 5]. The criteria for peer review of ontology-based research results proposed by Smith [5], Orbst et al. [4] and others [111], are largely derived from ‘common sense’ or philosophical positions and do not rely on an objective, empirical demonstration that the criteria improve the performance of ontologies in any biomedical application. Such a peer review system is intended to be adopted by the OBO Foundry ontology community [1, 5]. The OBO Foundry principles (accepted and proposed principles can be found on http://obofoundry.org/crit.shtml) form some of the most widely used criteria for ontology development in biology. The majority of the OBO Foundry criteria are intrinsically social and highly valuable for enabling wide access to the content of the ontologies, serving scientific discourse about and investigations into the ontologies and their content. To evaluate ontologies based on social criteria, peer review is valuable. Some criteria could be further extended by asking for empirical, quantifiable evidence. For example, while the inclusion of textual definitions (criterion 6) and documentation (criterion 8) can improve comprehensibility of ontologies, comprehensibility will primarily depend on the quality of the textual definitions and documentation: not all definitions and all documentations are equally well suited. User-studies can be used to evaluate and quantify the quality of the definitions and even compare them against automated methods to generate textual definitions [112].

BUILDING AND EVALUATING ONTOLOGIES FOR INTEGRATIVE RESEARCH

The development of a systematic evaluation strategy grounded in real biomedical data will help to further improve the utility of ontologies for integrative biomedical research. To develop such a strategy, different approaches for evaluating ontologies can be combined. The direct evaluation of ontologies (see Figure 1), such as facilitated by ontology peer review, is an approach for evaluating ontologies that ensures availability of the ontologies, compliance with good scientific practices and reporting standards, the use of standard formats in distributing ontologies, and other valuable criteria.

Figure 1:

A direct evaluation of an ontology can assess intrinsic properties of the ontology such as consistency, expressivity, or the inclusion of natural language definitions and labels. Furthermore, the evaluating person can examine definitions and axioms of the ontology and either agree or disagree with their content. However, a review of an ontology alone does not immediately evaluate the ontology’s suitability for particular applications and analyses. Therefore, ontology evaluation can be further substantiated by an application-based evaluation (see Figure 2). In such an evaluation, an ontology is not assessed directly, but rather by means of an application that makes use of the ontology. Depending on what type of application the ontology is used for, a large variety of evaluation criteria can be applied to report, compare and quantify the results. Some of these criteria are listed in Table 2.

Figure 2:

An application-based evaluation does not directly assess an ontology, but rather evaluates an application that utilizes an ontology for its operations.

An application-based evaluation does not directly assess an ontology, but rather evaluates an application that utilizes an ontology for its operations. One major type of application in a research setting is to facilitate the integrated analysis of scientific data. In such a scenario, the ‘application’ that is used to evaluate an ontology is an integrated scientific analysis (see Figure 3). Evaluation criteria in such a scenario follow the established criteria in the scientific domain and range from comparisons with a gold standard to experimental validation.

Figure 3:

An analysis-based evaluation performs a scientific data analysis that relies on an ontology and evaluates the success of the analysis using criteria established in the scientific domain.

An analysis-based evaluation performs a scientific data analysis that relies on an ontology and evaluates the success of the analysis using criteria established in the scientific domain. These three strategies for evaluating research in ontology are complementary and ensure different aspect of an ontology’s quality. Peer review can assure social criteria as well as adherence to scientific reporting standard, application-based evaluation ensures that ontologies can be used efficiently and the evaluation using a scientific analysis ensures that ontologies lead to verifiable novel insights in science. An adoption of this combined evaluation methodology shifts the research focus in ontology research from building better ontologies towards systematically improving the ontologies with regard to ontology-based applications and ontology-based scientific analyses, and thereby paves the way for the critical role that ontologies will continue to play in the future. In particular, an area that will benefit from integrated ontology-based data analysis methods include experiment design [113]. In experiment design, ontologies can be used to relate experimental assays to the biological phenomena that are recorded by the assay catering for the experiments to be then designed so that they can test specific hypotheses about the scientific domain [114]. Furthermore, ontologies are now being used to annotate large data sets, including those originating from high-throughput technologies in all areas of biology, and it is a major challenge of biology to synthesize the available information into an understanding of whole organisms and their interactions with the environment as well as to transform this information into knowledge that can benefit human health. Ontologies will play a crucial role in this integration process because they provide the means to integrate data not only within domains, but also across domains, across species and across levels of granularity. For example, personalizing the treatment of disease based on the background of the individual patient requires integration of large amounts of data across domains, including information about genetic variation and their associations with phenotypes and drug response [115], genomic, transcriptomic, proteomic, metabolomic, and autoantibody information [116, 117], environmental factors [118], and the patient’s medical history [119]. Another example in which ontologies will increasingly be applied is to bridge the gap between basic research results and clinical applications. In the last years, several pioneering studies used ontologies as a means to understand, diagnose and find treatment strategies for human diseases [68, 120–122]. Again, it is the potential of ontologies to connect data from different scientific domains and disciplines on a large scale that has enabled such analyses, and it is one of the most promising future applications of ontologies in biomedicine. One of the great challenges in using ontologies to facilitate integrative, translational biomedical analyses is to connect ontologies that cover basic research domains, such as the GO [11], with medical ontologies, such as SNOMED CT [123] and the repository of ontologies that are within the Unified Medical Language System (UMLS) [124]. The medical ontologies provide access to data in health care and medical knowledge while the biological ontologies enable access to findings from basic research in biology, and the integration of both types of ontologies has the potential to enable analyses that connect basic research with clinical applications and support the personalization of medical treatment. A strategy for evaluating the ontologies involved in such a task, as well as assessing the ontology-based integration results, is a crucial step to facilitate this goal.

CONCLUSIONS

Research results in biomedical ontology should always be evaluated against a biomedical task for which the ontologies are intended. Whether the research result is an ontology, an ontology design pattern, or a method to formulate biomedical phenomena, the benefit ontologies can bring cannot be evaluated based on the ontology alone; instead, any evaluation criteria must evaluate the whole system consisting of the ontology and the tasks to which they are applied. Many ontology-based applications are amenable to quantitative evaluation criteria. Quantitative measures enable the objective comparison of research results and play a crucial role in their evaluation. These quantitative measures can be adopted in addition to already established qualitative evaluation criteria, and they can also serve to justify and refine existing qualitative measures. Furthermore, with the application of quantitative measures, ontology development methodologies can be evaluated with respect to how well they ensure or improve the performance of research results in particular tasks within a domain. More importantly, objective evaluation criteria for research results are the next step in developing a research methodology for the field of biomedical ontologies. A research methodology based on quantitative evaluation with respect to biomedical applications will improve the ontologies’ utility in data and knowledge integration and thereby increase their potential to improve integrative biology and translational research. Ontologies are used in biomedicine to standardize terminology, to enable access to domain knowledge, to verify data consistency and to facilitate integrative analyses over heterogeneous biomedical data. Biomedical ontologies must be evaluated with respect to the purpose for which they are built. Ontology-based applications can be evaluated quantitatively. Quantitative evaluation can lead to developing a methodology for systematically improving biomedical ontologies.

FUNDING

Funding for RH was provided by the European Commission's 7th Framework Programme, RICORDO project, grant number 248502. Funding for MD was provided by a National Sciences and Engineering Research Council of Canada Discovery Grant. Funding for GVG was provided by the National Institutes of Health, grant number R01 HG004838-02. The open access publication was funded by the European Commission's 7th Framework Programme, RICORDO project, grant number 248502.

86 in total

1. Strengths and limitations of formal ontologies in the biomedical domain.

Authors: Stefan Schulz; Holger Stenzhorn; Martin Boeker; Barry Smith
Journal: Rev Electron Comun Inf Inov Saude Date: 2009-03-01

2. Data-driven prediction of drug effects and interactions.

Authors: Nicholas P Tatonetti; Patrick P Ye; Roxana Daneshjou; Russ B Altman
Journal: Sci Transl Med Date: 2012-03-14 Impact factor: 17.956

3. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

4. Applying the functional abnormality ontology pattern to anatomical functions.

Authors: Robert Hoehndorf; Axel-Cyrille Ngonga Ngomo; Janet Kelso
Journal: J Biomed Semantics Date: 2010-03-31

5. ArrayExpress update--an archive of microarray and high-throughput sequencing-based functional genomics experiments.

Authors: Helen Parkinson; Ugis Sarkans; Nikolay Kolesnikov; Niran Abeygunawardena; Tony Burdett; Miroslaw Dylag; Ibrahim Emam; Anna Farne; Emma Hastings; Ele Holloway; Natalja Kurbatova; Margus Lukk; James Malone; Roby Mani; Ekaterina Pilicheva; Gabriella Rustici; Anjan Sharma; Eleanor Williams; Tomasz Adamusiak; Marco Brandizi; Nataliya Sklyar; Alvis Brazma
Journal: Nucleic Acids Res Date: 2010-11-10 Impact factor: 16.971

6. The BioPAX community standard for pathway data sharing.

Authors: Emek Demir; Michael P Cary; Suzanne Paley; Ken Fukuda; Christian Lemer; Imre Vastrik; Guanming Wu; Peter D'Eustachio; Carl Schaefer; Joanne Luciano; Frank Schacherer; Irma Martinez-Flores; Zhenjun Hu; Veronica Jimenez-Jacinto; Geeta Joshi-Tope; Kumaran Kandasamy; Alejandra C Lopez-Fuentes; Huaiyu Mi; Elgar Pichler; Igor Rodchenkov; Andrea Splendiani; Sasha Tkachev; Jeremy Zucker; Gopal Gopinath; Harsha Rajasimha; Ranjani Ramakrishnan; Imran Shah; Mustafa Syed; Nadia Anwar; Ozgün Babur; Michael Blinov; Erik Brauner; Dan Corwin; Sylva Donaldson; Frank Gibbons; Robert Goldberg; Peter Hornbeck; Augustin Luna; Peter Murray-Rust; Eric Neumann; Oliver Ruebenacker; Oliver Reubenacker; Matthias Samwald; Martijn van Iersel; Sarala Wimalaratne; Keith Allen; Burk Braun; Michelle Whirl-Carrillo; Kei-Hoi Cheung; Kam Dahlquist; Andrew Finney; Marc Gillespie; Elizabeth Glass; Li Gong; Robin Haw; Michael Honig; Olivier Hubaut; David Kane; Shiva Krupa; Martina Kutmon; Julie Leonard; Debbie Marks; David Merberg; Victoria Petri; Alex Pico; Dean Ravenscroft; Liya Ren; Nigam Shah; Margot Sunshine; Rebecca Tang; Ryan Whaley; Stan Letovksy; Kenneth H Buetow; Andrey Rzhetsky; Vincent Schachter; Bruno S Sobral; Ugur Dogrusoz; Shannon McWeeney; Mirit Aladjem; Ewan Birney; Julio Collado-Vides; Susumu Goto; Michael Hucka; Nicolas Le Novère; Natalia Maltsev; Akhilesh Pandey; Paul Thomas; Edgar Wingender; Peter D Karp; Chris Sander; Gary D Bader
Journal: Nat Biotechnol Date: 2010-09-09 Impact factor: 54.908

7. Unintended consequences of existential quantifications in biomedical ontologies.

Authors: Martin Boeker; Ilinca Tudose; Janna Hastings; Daniel Schober; Stefan Schulz
Journal: BMC Bioinformatics Date: 2011-11-24 Impact factor: 3.169

8. Automating generation of textual class definitions from OWL to English.

Authors: Robert Stevens; James Malone; Sandra Williams; Richard Power; Allan Third
Journal: J Biomed Semantics Date: 2011-05-17

9. FlyBase: enhancing Drosophila Gene Ontology annotations.

Authors: Susan Tweedie; Michael Ashburner; Kathleen Falls; Paul Leyland; Peter McQuilton; Steven Marygold; Gillian Millburn; David Osumi-Sutherland; Andrew Schroeder; Ruth Seal; Haiyan Zhang
Journal: Nucleic Acids Res Date: 2008-10-23 Impact factor: 16.971

10. Building a cell and anatomy ontology of Caenorhabditis elegans.

Authors: Raymond Y N Lee; Paul W Sternberg
Journal: Comp Funct Genomics Date: 2003

21 in total

1. Design and Development of a Sharable Clinical Decision Support System Based on a Semantic Web Service Framework.

Authors: Yi-Fan Zhang; Ling Gou; Yu Tian; Tian-Chang Li; Mao Zhang; Jing-Song Li
Journal: J Med Syst Date: 2016-03-22 Impact factor: 4.460

2. Modeling the autism spectrum disorder phenotype.

Authors: Alexa T McCray; Philip Trevvett; H Robert Frost
Journal: Neuroinformatics Date: 2014-04

3. Analyzing gene expression data in mice with the Neuro Behavior Ontology.

Authors: Robert Hoehndorf; John M Hancock; Nigel W Hardy; Ann-Marie Mallon; Paul N Schofield; Georgios V Gkoutos
Journal: Mamm Genome Date: 2013-11-01 Impact factor: 2.957

4. Evaluating the Emotion Ontology through use in the self-reporting of emotional responses at an academic conference.

Authors: Janna Hastings; Andy Brass; Colin Caine; Caroline Jay; Robert Stevens
Journal: J Biomed Semantics Date: 2014-09-03

5. Evaluation and cross-comparison of lexical entities of biological interest (LexEBI).

Authors: Dietrich Rebholz-Schuhmann; Jee-Hyub Kim; Ying Yan; Abhishek Dixit; Caroline Friteyre; Robert Hoehndorf; Rolf Backofen; Ian Lewin
Journal: PLoS One Date: 2013-10-04 Impact factor: 3.240

Review 6. Thematic series on biomedical ontologies in JBMS: challenges and new directions.

Authors: Robert Hoehndorf; Melissa Haendel; Robert Stevens; Dietrich Rebholz-Schuhmann
Journal: J Biomed Semantics Date: 2014-03-06

7. The Porifera Ontology (PORO): enhancing sponge systematics with an anatomy ontology.

Authors: Robert W Thacker; Maria Cristina Díaz; Adeline Kerner; Régine Vignes-Lebbe; Erik Segerdell; Melissa A Haendel; Christopher J Mungall
Journal: J Biomed Semantics Date: 2014-09-08

8. Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data.

Authors: Warren A Kibbe; Cesar Arze; Victor Felix; Elvira Mitraka; Evan Bolton; Gang Fu; Christopher J Mungall; Janos X Binder; James Malone; Drashtti Vasant; Helen Parkinson; Lynn M Schriml
Journal: Nucleic Acids Res Date: 2014-10-27 Impact factor: 16.971

9. An ontology approach to comparative phenomics in plants.

Authors: Anika Oellrich; Ramona L Walls; Ethalinda Ks Cannon; Steven B Cannon; Laurel Cooper; Jack Gardiner; Georgios V Gkoutos; Lisa Harper; Mingze He; Robert Hoehndorf; Pankaj Jaiswal; Scott R Kalberer; John P Lloyd; David Meinke; Naama Menda; Laura Moore; Rex T Nelson; Anuradha Pujar; Carolyn J Lawrence; Eva Huala
Journal: Plant Methods Date: 2015-02-25 Impact factor: 4.993

10. Measuring the evolution of ontology complexity: the gene ontology case study.

Authors: Olivier Dameron; Charles Bettembourg; Nolwenn Le Meur
Journal: PLoS One Date: 2013-10-11 Impact factor: 3.240