Literature DB >> 29788413

ezTag: tagging biomedical concepts via interactive learning.

Dongseop Kwon¹, Sun Kim², Chih-Hsuan Wei², Robert Leaman², Zhiyong Lu².

Abstract

Recently, advanced text-mining techniques have been shown to speed up manual data curation by providing human annotators with automated pre-annotations generated by rules or machine learning models. Due to the limited training data available, however, current annotation systems primarily focus only on common concept types such as genes or diseases. To support annotating a wide variety of biological concepts with or without pre-existing training data, we developed ezTag, a web-based annotation tool that allows curators to perform annotation and provide training data with humans in the loop. ezTag supports both abstracts in PubMed and full-text articles in PubMed Central. It also provides lexicon-based concept tagging as well as the state-of-the-art pre-trained taggers such as TaggerOne, GNormPlus and tmVar. ezTag is freely available at http://eztag.bioqrator.org.

Entities: CellLine Chemical Disease Species

Mesh：

Year: 2018 PMID： 29788413 PMCID： PMC6030907 DOI： 10.1093/nar/gky428

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Efficient access to information contained in the biomedical literature plays several key roles in experiments, from the early stages of planning to the final interpretation of the results. This biological knowledge can be obtained effectively from expert-curated databases such as UniProt (1). However, the increasing number of new publications makes the cost of manual curation more challenging. Since manual curation alone is not sufficient to keep biological databases up to date (2), computer-assisted curation by text mining techniques has gained popularity in recent years (3,4). While there are numerous web-based annotation tools available (5–12), they are mostly task-specific, tuned on certain gold standard sets and/or knowledge bases (13). Due to the limited resources available (14–17), computer-assisted annotation has normally focused on common biological concepts such as gene/protein, chemical and disease names. Certain annotation tools (7,8) support more entity types, however they are rule-based in general, i.e. assigning concept types is achieved by lexical pattern matching. Moreover, only few studies suggest the idea of adaptive bio-entity annotation via interactive learning (12,18). Another critical issue when developing an annotation tool is whether it supports full text articles. Even though biocurators read full-text articles as well as abstracts for manual curation, full-text articles have not been well supported by existing annotation tools (13). This is partially due to the difficulty of parsing various XML formats, as well as complex copyright issues for certain journals. For example, the most common user requests for PubTator (11), a widely used annotation tool for biomedical concepts we introduced in 2013, have been to support PubMed Central (PMC) full-text articles and to provide more flexibility for text-mined annotation. The latter is essential for some users because annotation guidelines may differ even for common bio-entity types. To address these problems, we introduce ezTag, a user-friendly annotation tool that allows biocurators to perform annotation and provide training data interactively. Compared to other bio-entity annotation tools, ezTag has several unique features. First, ezTag supports all PubMed abstracts and PMC open access articles. We achieved this by standardizing the text from both repositories into BioC format (19); it also supports any other document in BioC format. Second, ezTag users have multiple ways of annotating bio-entities: (i) the pre-trained state-of-the-art bio-entity taggers (20–22), (ii) the string pattern match tagger, which uses a user-provided lexicon and (iii) the customized tagger by training TaggerOne (20). Third, ezTag explicitly supports training and annotating text iteratively, hence it helps produce a set of annotated documents and a customized tagging module in any bio-entities efficiently. Other features include a user-friendly interface based on PubTator user feedback, automatic session ID-based login, i.e. no manual login required and RESTful API support for customized tagging modules. As a result, biocurators can annotate documents without much help from software developers. Also, software developers without text mining experience can benefit from our RESTful APIs. Throughout this paper, concept tagging is used to describe bio-entity annotation and it may or may not include assigning concept IDs (i.e. normalization or grounding).

SYSTEM DESCRIPTION

Figure 1 illustrates the system overview of ezTag. As shown in the figure, ezTag utilizes multiple resources to provide bio-entity annotation in biomedical text. For input, any documents in the BioC format (http://bioc.sourceforge.net) (19) can be uploaded to the interface. We chose BioC for input and output for better data interoperability. PubMed abstracts and PMC full-text articles are pre-processed in BioC and ready for upload using PubMed and PMC IDs. These BioC documents are also accessible through RESTful APIs (https://www.ncbi.nlm.nih.gov/research/bionlp/APIs), thus users can always process and share the same BioC documents. Lexicons (Figure 1) are used in two different scenarios. One is for the string match-based tagger, and the other is to assign concept IDs for the machine learning-based tagger (TaggerOne in the figure).

Figure 1.

System overview. ezTag connects multiple resources to provide efficient and effective biological concept tagging. Input and output documents are handled using the BioC format, and user-provided lexicons are used for string match and machine learning-based taggers. The options for automatic concept tagging in text are (i) the string match-based tagger using a lexicon, (ii) the machine learning-based tagger using TaggerOne for customized tagging modules and (iii) the pre-trained taggers. The core functionalities of ezTag are manual annotation and automatic annotation. ezTag has three modules for automatic annotation: the string match-based tagger, the machine learning-based tagger and pre-trained taggers. The string match-based tagger uses a user-provided lexicon for identifying bio-entities and assigning concept IDs (i.e. normalization). Since this step may be used as a starting point for interactive learning (which will be explained later), we implemented a trie structure (23) for strict string match but also allowed small variations such as abbreviations, Greek letters, upper/lowercases, hyphens and other stopwords. The machine learning-based tagger system is TaggerOne (20). TaggerOne is a semi-Markov model for joint named entity recognition and normalization. Users can train TaggerOne for a set of annotated documents, then use the trained model to tag concepts in a new set of documents. Providing a lexicon is optional; however, if one is provided by the user, then TaggerOne also learns to assign concept IDs. In addition to the customizable modules, string match and machine learning based taggers, ezTag provides annotations from pre-trained taggers. The common bio-entities we support here are chemical, disease, gene/protein, organism/species and sequence variations. We utilize three state-of-the-art performance tools for those common types. The pre-trained TaggerOne (20) is used for annotating chemical and disease names. GNormPlus (21) is used for annotating gene/protein and organism/species. tmVar (22) is for sequence variations. Table 1 lists all pre-trained concept taggers used in ezTag and the bio-entities and nomenclatures they use. The last column of the table also shows the normalization performance (F1 scores) of each concept tagger based on the gold standard sets reported in (20–22,24). Note that the F1 performance at the mention level (i.e. identifying bio-entities only) is typically higher than those at the normalization level.

Table 1.

Pre-trained concept tagging tools used in ezTag

Pre-trained tagger	Bio-entity	Nomenclature	F1 score (normalization)
TaggerOne	Chemical	MeSH	0.895
	Disease	MEDIC	0.807
GNormPlus	Gene	NCBI Gene	0.867
	Species	NCBI Taxonomy	0.854
tmVar	Sequence variation	NCBI dbSNP	0.903

MEDIC is a disease vocabulary created by Comparative Toxicogenomics Database. All other vocabularies are products of National Library Medicine. F1 scores are taken from their corresponding publications.

Implementation

We developed ezTag using Ruby on Rails and MySQL as a backend database. RESTful APIs were implemented in C++ and Perl. All the web pages in ezTag are HTML5/CSS compatible, thus it supports the latest version of popular web browsers such as Chrome, Safari, Firefox and Internet Explorer. On rare occasions, Internet Explorer may not correctly display some icons due to HTML5 compatibility issues. The source code of the ezTag web interface is available at https://github.com/ncbi-nlp/ezTag.

USAGE

User interface

ezTag was motivated by the feedback from PubTator users and designed to merge useful features of PubTator (11), TaggerOne (20) and BioC Viewer (25). The two primary approaches to providing assisted annotations for concept tagging are string match and machine learning. In ezTag, we support both approaches by implementing a lexicon-based string match tagger and integrating with multiple machine learning-based taggers. ezTag also allows users to choose a training set for a customized tagging module. For a smooth annotation experience, users should first create a collection for an annotation project. An annotation task is then started by uploading documents in BioC format or using PubMed and PMC IDs in the collection. ezTag has top menus for lexicons and customized models (These are called ‘Lexicons’ and ‘Models’ in the web pages, respectively). In this way, lexicons and models can be used to annotate any collection present in the user’s repository. Figure 2 shows a screenshot of the ‘sample training set’ collection page. As described earlier, ezTag has two main functions, automatic annotation and manual annotation ((a) and (c) in the figure, respectively). Users can also create a customized module using the collection to train a model ((b) in the figure).

Figure 2.

ezTag user interface for the sample training set. An annotation project (e.g. the sample training set here) is called ‘collection’ in ezTag. Uploaded documents belong to a collection and these documents are used for (a) auto annotation (i.e. pre-trained, lexicon or machine learning based concept tagging), (b) training a machine learning-based tagger (i.e. TaggerOne) and (c) manual annotation.

Input and output

ezTag uses the BioC format for both input and output documents. Annotated or unannotated documents are used as input. The output is a set of documents annotated automatically or manually. If a tagging module was created by training a collection, the customized module is also an output of ezTag.

Manual annotation

ezTag has a manual annotation tool supporting an arbitrary number of bio-entity types (Figure 3). Manual annotation can be used for adding annotations from scratch, or to refine existing annotations. For easy browsing in full-text articles, the annotation window has an outline view ((d) in the figure), which will appear in the left of the window when the document has multiple sections. Clicking a section in the outline view will move the mouse focus to where the section is in the main text column. Since ezTag aims to be a general annotation tool, only manual typing in is allowed for entering concept IDs in the current version. The last step of manual annotation is to toggle on the ‘Complete’ button ((c) in the figure), which indicates that the annotation is complete and the document may be used for training. When the ‘Complete’ button is on, the document will be used to train a customized tagging model. When it is off, the document will be used for tagging concepts. Note that, although ezTag does not fully support highlighting overlapping annotations, it keeps and displays all annotations in the annotation table.

Figure 3.

Manual annotation page in ezTag. There are two main windows: (a) main text and (b) annotation table. Users can add an annotation by a mouse drag on text (a), and tag the annotation by typing ID(s) (b). The complete button (c) is used to mark whether annotation of the document is done. Using this mark, ezTag decides how a document should be used, i.e. either for automatic annotation or for training TaggerOne. A browsable outline will appear in the left if a document has multiple sections (d).

Automatic annotation (‘Auto Annotate’)

Using lexicons

When no annotated set or pre-trained tagger is available for the desired entity type, but there is a dictionary of concept names available, this is a good option to start with. Our string match-based tagger uses a user-provided lexicon to tag text. When concept IDs are given in the lexicon, the tagger also assigns the IDs along with annotated text. The user can then review and refine the annotated text as a next step. This also can be followed by training our machine learning-based tagger, TaggerOne, for a new customized tagger.

Using pre-trained tagging models

This auto annotation can be used when user targeted bio-entities fall in one of types that our pre-trained models support. If users have different annotation guidelines in mind, annotated text from this step can be refined and the result can be fed into TaggerOne for a customized tagger.

Using customized tagging models

TaggerOne is a model that jointly predicts mentions and their normalized IDs. Since it does not require specific bio-entity types, one can use TaggerOne to train and predict a wide variety of entity types. After training TaggerOne (‘Train’ in the ezTag interface), one obtains a customized tagger, and this model will be available in the auto annotation menu (as a pre-trained tagging model). As in other auto annotate steps, the output from a customized tagging model for a new set can be used to obtain an improved customized tagger (see ‘Use case’ for more details).

Programmatic access via RESTful API

In addition to a user-friendly web interface, ezTag provides a RESTful API for accessing customized concept taggers. For a customized tagger, users can input text and get annotations programmatically via API. A help page of how to use the API is shown in each customized model web page. One can share a customized tagger with anyone using a RESTful API. RESTful APIs for pre-trained taggers are also available and the instructions can be found at https://www.ncbi.nlm.nih.gov/research/bionlp/APIs.

Session-based automatic logins

ezTag does not require a manual login, but allows users to continue their annotation processes by assigning session IDs. Once a user enters ezTag, a session ID is automatically created (if it does not exist already). A user session is recorded in web cookies, hence normally there is no need to re-login using the session ID. Users can also get a unique URL for each session ID via email to ensure their session remains accessible.

USE CASE: INTERACTIVE LEARNING FOR ADAPTIVE ANNOTATION

ezTag provides multiple ways of annotating text: manual annotation and auto annotation. Auto annotation has three options: string match tagger, machine learning tagger for customized models and pre-trained taggers for common bio-entity types. Because of the flexibility provided by these diverse options, users can perform interactive learning for adaptive entity tagging. Figure 4 presents our interactive learning workflow using ezTag, with the interactive part highlighted in the gray box with the dotted line. As shown in the top left, if annotated training data already exists, ezTag allows the user to train a machine learning tagging model on-the-fly using TaggerOne, that in turn provides pre-annotations for human review. If no training data is initially available, pre-annotations can be obtained either through pre-trained methods or a string matching approach with a user-provided concept lexicon. In either case, when computer pre-annotations are reviewed and refined by the human annotator, they can be iteratively fed into TaggerOne to build a new and improved model that in turn provides higher quality pre-annotations. As a result, for input documents with/without annotations, users obtain a customized concept tagging model and/or higher quality annotated documents as output.

Figure 4.

Interactive learning workflow. New documents are first manually annotated or refined after applying the pre-trained or string match tagger. Interactive learning follows three steps interactively: (i) Training TaggerOne using annotated documents, (ii) tagging biological concepts in new documents using the trained model and (iii) refining the documents by correcting annotation mistakes. To check the effectiveness of interactive learning, we performed ablation tests using BioCreative V CDR (26) and NCBI Disease (17) datasets for chemical and diseases, respectively. We assume users annotate 100 documents in each iteration, and add them to the existing training set, i.e. starting from 100 documents, the number of annotated documents are accumulated after each iteration. The performance is measured by F1 scores on the (independent) test set. This experiment simulates the interactive process and shows how much improvement one would get through the process. We ran the experiments five times for each dataset, and averaged F1 scores in terms of chemical/disease name identification (see Figure 5). The 100 documents added each iteration were randomly chosen. We observe that the performance after five iterations approaches the upper bound (i.e. the performance when we could obtain using all training examples). Moreover, a tagger that is imperfect but useful for pre-annotation is likely obtainable from a relatively small number of documents. In practice, the performance realized will depend on the entity type and the quantity and variety of entity mentions that appear in the documents annotated, and should be evaluated periodically as annotation proceeds.

Figure 5.

Performance changes over accumulated training documents for identifying chemical and disease names. The dotted lines indicate the upper bound that TaggerOne can achieve, i.e. when all training documents are used.

CONCLUSION

ezTag is a versatile annotation tool that enables users to perform both simple annotation and complex interactive learning for adaptive concept tagging. With or without pre-existing training data, users can obtain annotated text for a wide variety of bio-entities. This made possible by combining the user-friendly web interface with a new string match based tagger and state-of-the-art annotation tools such as TaggerOne, GNormPlus and tmVar. Moreover, the interactive learning framework via ezTag will reduce the annotation burden for new or adaptive concept tagging. While PubTator, our earlier tool, pre-annotates common bio-entities in PubMed abstracts, ezTag annotates any documents including PubMed abstracts and PMC full-text articles on the fly. The popular PubTator tool still helps access high quality pre-annotated text, but ezTag will complement PubTator by supporting full-text articles and adaptive annotation for more bio-entities. Currently, we display only text from the body of PMC articles, however curators often use figures and tables for annotation. In the future, we plan to include figures and tables in BioC and display the graphics in ezTag for better annotation experience. Another limitation is the lack of disjoint annotation and PDF support, which also remain as future work.

DATA AVAILABILITY

ezTag is free and open to all users and there is no login requirement. ezTag can be accessed at http://eztag.bioqrator.org. The source code of the ezTag web interface is also available at https://github.com/ncbi-nlp/ezTag.

24 in total

1. MyMiner: a web application for computer-assisted biocuration and text annotation.

Authors: David Salgado; Martin Krallinger; Marc Depaule; Elodie Drula; Ashish V Tendulkar; Florian Leitner; Alfonso Valencia; Christophe Marcelle
Journal: Bioinformatics Date: 2012-07-12 Impact factor: 6.937

2. Manual curation is not sufficient for annotation of genomic databases.

Authors: William A Baumgartner; K Bretonnel Cohen; Lynne M Fox; George Acquaah-Mensah; Lawrence Hunter
Journal: Bioinformatics Date: 2007-07-01 Impact factor: 6.937

Review 3. A survey on annotation tools for the biomedical literature.

Authors: Mariana Neves; Ulf Leser
Journal: Brief Bioinform Date: 2012-12-18 Impact factor: 11.622

4. BioCreative V CDR task corpus: a resource for chemical disease relation extraction.

Authors: Jiao Li; Yueping Sun; Robin J Johnson; Daniela Sciaky; Chih-Hsuan Wei; Robert Leaman; Allan Peter Davis; Carolyn J Mattingly; Thomas C Wiegers; Zhiyong Lu
Journal: Database (Oxford) Date: 2016-05-09 Impact factor: 3.451

5. BioC viewer: a web-based tool for displaying and merging annotations in BioC.

Authors: Soo-Yong Shin; Sun Kim; W John Wilbur; Dongseop Kwon
Journal: Database (Oxford) Date: 2016-08-10 Impact factor: 3.451

6. Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine.

Authors: Ayush Singhal; Michael Simmons; Zhiyong Lu
Journal: PLoS Comput Biol Date: 2016-11-30 Impact factor: 4.475

7. UniProt: the universal protein knowledgebase.

Authors:
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

8. On expert curation and scalability: UniProtKB/Swiss-Prot as a case study.

Authors: Sylvain Poux; Cecilia N Arighi; Michele Magrane; Alex Bateman; Chih-Hsuan Wei; Zhiyong Lu; Emmanuel Boutet; Hema Bye-A-Jee; Maria Livia Famiglietti; Bernd Roechert; The UniProt Consortium
Journal: Bioinformatics Date: 2017-11-01 Impact factor: 6.937

9. Assisting manual literature curation for protein-protein interactions using BioQRator.

Authors: Dongseop Kwon; Sun Kim; Soo-Yong Shin; Andrew Chatr-aryamontri; W John Wilbur
Journal: Database (Oxford) Date: 2014-07-22 Impact factor: 3.451

10. Text-mining-assisted biocuration workflows in Argo.

Authors: Rafal Rak; Riza Theresa Batista-Navarro; Andrew Rowley; Jacob Carter; Sophia Ananiadou
Journal: Database (Oxford) Date: 2014-07-18 Impact factor: 3.451

6 in total

1. TeamTat: a collaborative text annotation tool.

Authors: Rezarta Islamaj; Dongseop Kwon; Sun Kim; Zhiyong Lu
Journal: Nucleic Acids Res Date: 2020-07-02 Impact factor: 16.971

2. An extensive review of tools for manual annotation of documents.

Authors: Mariana Neves; Jurica Ševa
Journal: Brief Bioinform Date: 2021-01-18 Impact factor: 11.622

3. PubTerm: a web tool for organizing, annotating and curating genes, diseases, molecules and other concepts from PubMed records.

Authors: José Garcia-Pelaez; David Rodriguez; Roberto Medina-Molina; Gerardo Garcia-Rivas; Carlos Jerjes-Sánchez; Victor Trevino
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

Review 4. A Review on Human-AI Interaction in Machine Learning and Insights for Medical Applications.

Authors: Mansoureh Maadi; Hadi Akbarzadeh Khorshidi; Uwe Aickelin
Journal: Int J Environ Res Public Health Date: 2021-02-22 Impact factor: 3.390

5. MedTAG: a portable and customizable annotation tool for biomedical documents.

Authors: Fabio Giachelle; Ornella Irrera; Gianmaria Silvello
Journal: BMC Med Inform Decis Mak Date: 2021-12-18 Impact factor: 2.796

6. ECO-CollecTF: A Corpus of Annotated Evidence-Based Assertions in Biomedical Manuscripts.

Authors: Elizabeth T Hobbs; Stephen M Goralski; Ashley Mitchell; Andrew Simpson; Dorjan Leka; Emmanuel Kotey; Matt Sekira; James B Munro; Suvarna Nadendla; Rebecca Jackson; Aitor Gonzalez-Aguirre; Martin Krallinger; Michelle Giglio; Ivan Erill
Journal: Front Res Metr Anal Date: 2021-07-13

6 in total