| Literature DB >> 28938912 |
Jelena Jovanović1, Ebrahim Bagheri2.
Abstract
The abundance and unstructured nature of biomedical texts, be it clinical or research content, impose significant challenges for the effective and efficient use of information and knowledge stored in such texts. Annotation of biomedical documents with machine intelligible semantics facilitates advanced, semantics-based text management, curation, indexing, and search. This paper focuses on annotation of biomedical entity mentions with concepts from relevant biomedical knowledge bases such as UMLS. As a result, the meaning of those mentions is unambiguously and explicitly defined, and thus made readily available for automated processing. This process is widely known as semantic annotation, and the tools that perform it are known as semantic annotators.Over the last dozen years, the biomedical research community has invested significant efforts in the development of biomedical semantic annotation technology. Aiming to establish grounds for further developments in this area, we review a selected set of state of the art biomedical semantic annotators, focusing particularly on general purpose annotators, that is, semantic annotation tools that can be customized to work with texts from any area of biomedicine. We also examine potential directions for further improvements of today's annotators which could make them even more capable of meeting the needs of real-world applications. To motivate and encourage further developments in this area, along the suggested and/or related directions, we review existing and potential practical applications and benefits of semantic annotators.Entities:
Keywords: Biomedical ontologies; Biomedical text mining; Natural language processing (NLP); Semantic annotation; Semantic technologies
Mesh:
Year: 2017 PMID: 28938912 PMCID: PMC5610427 DOI: 10.1186/s13326-017-0153-x
Source DB: PubMed Journal: J Biomed Semantics
An overview of ontologies, thesauri and knowledge bases used by biomedical semantic annotation tools discussed in the paper
| BioPortal ( | A major repository of biomedical ontologies, currently hosting over 500 ontologies, controlled vocabularies and terminologies. Its Resource Index provides an ontology-based unified index of and access to multiple heterogeneous biomedical resources (annotated with BioPortal ontologies). |
| DBpedia ( | “Wikipedia for machines”, that is, a huge KB developed through a community effort of extracting information from Wikipedia and representing it in a structured format suitable for automated machine processing. It is the central hub of the Linked Open Data Cloud. |
| LLD - Linked Life Data ( | LLD platform provides access to a huge KB that includes and semantically interlinks knowledge about genes, proteins, molecular interactions, pathways, drugs, diseases, clinical trials and other related types of biomedical entities. It is part of the Linked Open Data Cloud ( |
| NCBI Biosystems Database ( | Repository providing integrated access to structured data and knowledge about biological systems and their components: genes, proteins, and small molecules. |
| OBO - Open Biomedical Ontologies ( | Community of ontology developers devoted to the development of a family of interoperable and scientifically accurate biomedical ontologies. Well known OBO ontologies include: |
| SNOMED CT ( | SNOMED CT is considered the world’s most comprehensive and precise, multilingual health terminology. It is used for the electronic exchange of clinical health information. It consists of concepts, concept descriptions (i.e., several terms that are used to refer to the concept), and concept relationships. |
| UMLS (Unified Medical Language System) Metathesaurus ( | The most well-known and widely used knowledge source in the biomedical domain. It assigns a unique identifier (CUI) to each medical concept and connects concepts to each other thus forming a graph-like structure; each concept (i.e. CUI) is associated with its ‘semantic type’, a broad category such as Gene, Disease or Syndrome; each concept is also associated with several terms used to refer to that concept in biomedical texts; these terms are pulled from nearly 200 biomedical vocabularies. Some well-known vocabularies that have been used by biomedical semantic annotators include: |
| UniProtKb/Swiss-Prot ( | Part of UniProtKB, a comprehensive protein sequence KB, which contains manually annotated entries. The entries are curated by biologists, regularly updated and cross-linked to numerous external databases, with the ultimate objective of providing all known relevant information about a particular protein. |
Example application cases of biomedical semantic annotation tools
| Application Case (AC) | The role of semantic annotation tool in the AC | Biomedical resources relevant for the AC (or representative examples, if multiple) |
|---|---|---|
| Semantic search of biomedical tools and services [ | Sematic search of biomedical tools and services enabled by semantic annotation of users’ (free-form) queries with concepts from UMLS Metathesaurus | Catalogs of and social spaces created around biomedical tools and services, e.g.: |
| Semantic search of domain specific scientific literature [ | Semantic annotation of PubMed entries with ontological concepts related to genes and proteins | Ontologies used for the annotation of biomedical references (PubMed entries): |
| Improved clinical decision making [ | Extraction of key clinical concepts (UMLS-based) required for supporting clinical decision making; the concepts are extracted from biomedical literature and clinical text sources | Sources of biomedical texts used to support decision making: |
| Unambiguous description of abbreviations [ | Extended (long) forms of abbreviations are matched against both UMLS and DBpedia concepts, thus not only disambiguating the long forms, but also connecting UMLS and DBpedia KBs | Allie - a search service for abbreviations and their long forms ( |
General purpose biomedical semantic annotation tools (Part I)
| cTAKES [ | NOBLE Coder [ | MetaMap [ | NCBO annotator [ | |
|---|---|---|---|---|
| Modularity/configuration options | Modular text processing pipeline | Vocabulary (terminology); | Text processing pipeline; | Vocabulary (terminology); |
| Disambiguation of terms | Enabled through integration of the YTEX component [ | Instead of through WSD, it uses heuristics to choose one concept among candidate concepts for the same piece of input text | Supported; based on: | Not supported |
| Vocabulary (terminology) | Subset of UMLS, namely SNOMED CT and RxNORM | Several pre-built vocabularies, based on subsets of UMLS | UMLS Metathesaurus | UMLS Metathesaurus and BioPortal ontologies (over 330 ontologies) |
| Speed* | Suitable for real-time processing | Suitable for real-time processing | Better for off-line batch processing | Suitable for real-time processing |
| Implementation form | Software (Java) library; | Software (Java) library; | Software library; | RESTful Web service |
| Availability | open source; | open-source; | open source; | closed source, but freely available |
| Specific features | Better performance on clinical texts than on biomedical scientific literature (its NLP components are trained on clinical texts) | Offers user interface for creating custom terminologies (to be used for annotation) by selecting and merging elements from several different thesauri/ontologies | Primarily developed for annotation of biomedical literature (MEDLINE/PubMed citations); performs better on this kind of text than clinical notes | It uses MGrep term-to-concept matching tool to get primary set of annotations; these are then extended using different forms of ontology-based semantic matching |
| URL |
|
|
|
|
*Note that speed estimates are based on the experimental results reported in the literature; those experiments were done with corpora of up to 200 documents (paper abstracts or clinical notes); the given estimates might not hold for significantly larger corpora
General purpose biomedical semantic annotation tools (Part II)
| BeCAS [ | Whatizit [ | ConceptMapper [ | Neji [ | |
|---|---|---|---|---|
| Modularity/configuration options | Semantic types (i.e. types of entities to annotate) | pre-built pipelines for several biomedical types (see Specific features) | Text processing pipeline; | Modular text processing pipeline |
| Disambiguation of terms | No information available | Not supported | Not supported | Instead of through WSD, it uses a set of heuristics rules to identify and remove annotations of lower importance |
| Vocabulary (terminology) | Custom built vocabulary by using concepts from multiple sources, such as UMLS, NCBI BioSystems, ChEBI, and the Gene Ontology. | The use of the vocabulary depends on the type of entity a pipeline is specialized for (e.g. NCBI KB for species, or Gene Ontology for genes) | General purpose dictionary lookup tool, not tied to any vocabulary | Not tied to any particular vocabulary |
| Speeda | Suitable for real-time processing | Suitable for real-time processing | Suitable for real-time processing | Suitable for real-time processing |
| Implementation form | Software (Python) library; | SOAP Web service | Software (Java) library; part of the UIMA NLP framework [ | RESTful Web service |
| Availability | open source; | closed source, but | open source; | open source; |
| Specific features | Primarily aimed for annotation of biomedical research papers; focused on annotation of several (11) types of biomedical entities, including species, microRNAs, enzymes, chemicals, drugs, diseases, etc. | Offers several pre-built pipelines for specific entity types; e.g. whatizitGO identifies proteins based on the Gene Ontology (GO), while whatizitChemical annotates chemical entities based on ChEBI | Not specifically developed for the biomedical domain, but is a general purpose dictionary lookup tool | Includes modules for both ML and dictionary-based annotation; can automatically combine annotations generated by different modules |
| URL |
|
|
|
|
aNote that speed estimates are based on the experimental results reported in the literature; those experiments were done with corpora of up to 200 documents (paper abstracts or clinical notes); the given estimates might not hold for significantly larger corpora
Corpora used for evaluation of biomedical semantic annotators. The table includes corpora that were used in the reported use cases (“Benefits and Use Cases” section, Table 2), and/or benchmarking of the discussed tools ("Summary of benchmarking results" and "Entity-specific biomedical annotation tools" sections)
| AnEM - Anatomical Entity Mention [ | The corpus consists of 500 documents selected randomly from citation abstracts and full-text biomedical research papers (from PubMed); it is manually annotated (over 3000 annotations) with anatomical entities. The corpus is available under the open CC-BY-SA license. |
| BC4GO [ | The corpus, developed for the BioCreative IV shared task, consists of 200 articles (over 5000 text passages) from Model Organism Databases; these articles were manually annotated with more than 1356 distinct GO terms. In addition to the core elements of GO annotations - a gene or gene product, a GO term, and a GO evidence code - the corpus also includes the GO evidence text. |
| CALBC - Collaborative Annotation of a Large Biomedical Corpus [ | A very large, publicly shared corpus of Medline abstracts automatically annotated with biomedical entities; the small corpus comprises ~175 K abstracts, whereas the big one consists of more than 714 K abstracts; since annotations were not made by humans but several annotation systems (and then aggregated), it is referred to as “silver standard”. |
| Chemical Disease Relation (CDR) [ | The corpus, developed for the BioCreative V shared task, consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases, and 3116 chemical-disease interactions. MeSH is used as the controlled vocabulary. |
| CRAFT - the Colorado Richly Annotated Full Text corpus [ | Publicly available, human annotated (gold standard) corpus of full-text biomedical journal articles; it consists of 67 document and 87,674 human annotations |
| GENETAG [ | Publicly available corpus of 20 K Medline sentences manually annotated with gene/protein names. Part of the corpus (15 K sentences) was used for the BioCreative I challenge (Gene Mention Identification task), and the rest (5 K sentences) was used as test data for BioCreative II competition (Gene Mention Tagging Task). URL: |
| GENIA [ | Open access manually annotated corpora consisting of 2000 Medline abstracts (400,000+ words) with almost 100,000 annotations for biological terms. Terms are annotated with concepts from the GENIA ontology, a formal model of cell signaling reactions in humans (the ontology is provided together with the corpus). |
| 2010 i2b2/VA corpus [ | The corpus consists of manually annotated de-identified clinical records (discharge summaries and progress reports) from three medical centers. It was originally created for the 2010 i2b2/VA NLP challenge to support 3 kinds of tasks: extraction of medical concepts from patient reports; assigning assertion types to medical problem concepts; and determining the type of relation between medical problems, tests, and treatments. The corpus consists of 394 annotated training reports, 477 annotated test reports, and 877 unannotated reports. |
| JNLPBA [ | A publicly available manually annotated corpus originally created for the Bio-Entity Recognition Task at BioNLP/NLPBA 2004. The training set consists of 2000 Medline abstracts extracted from the GENIA Version 3.02 corpus; the data set is annotated with five entity types: Protein, DNA, RNA, Cell_line, and Cell_type. The test set consists of 404 annotated Medline abstracts, also from the GENIA project; a half of this data set is from the same domain as that of the training data, whereas the other half is from the super domain of blood cells and transcription factors. |
| NCBI Disease corpus [ | Publicly available, manually annotated corpus of 793 PubMed abstracts; 6892 disease mentions are annotated with concepts from Medical Subject Headings (MeSH) and Online Mendelian Inheritance in Man (OMIM) vocabularies. |
| Mantra Gold Standard Corpus [ | Publicly available multilingual gold-standard corpus for biomedical concept recognition. It includes text from different types of parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. It contains 5530 annotations based on a subset of UMLS that covers a wide range of semantic groups. |
| ShARe - Shared Annotated Resources [ | Gold standard corpus of de-identified clinical free-text notes; it includes 199 documents and 4211 human annotations; originally prepared for the ShARe/CLEF eHealth Evaluation Lab focused on NLP and information retrieval tasks for clinical care. |