Literature DB >> 23703206

PubTator: a web-based text mining tool for assisting biocuration.

Chih-Hsuan Wei¹, Hung-Yu Kao, Zhiyong Lu.

Abstract

Manually curating knowledge from biomedical literature into structured databases is highly expensive and time-consuming, making it difficult to keep pace with the rapid growth of the literature. There is therefore a pressing need to assist biocuration with automated text mining tools. Here, we describe PubTator, a web-based system for assisting biocuration. PubTator is different from the few existing tools by featuring a PubMed-like interface, which many biocurators find familiar, and being equipped with multiple challenge-winning text mining algorithms to ensure the quality of its automatic results. Through a formal evaluation with two external user groups, PubTator was shown to be capable of improving both the efficiency and accuracy of manual curation. PubTator is publicly available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2013 PMID： 23703206 PMCID： PMC3692066 DOI： 10.1093/nar/gkt441

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Current biomedical research has become heavily dependent on the online access to knowledge in expert-curated biological databases. Manual curation is often required to build these knowledge bases, which involves biocurators reading articles, extracting key findings and cross-referencing data. Biocuration has become an essential part of biological discovery and biomedical research (1–3). However, as the volume of biological literature grows rapidly, it becomes increasingly difficult for biocurators to keep pace with the literature because manual biocuration is a highly expensive and time-consuming endeavour. To help ease the burden of manual curation, there have been increasing efforts to use automatic text-mining techniques (4–12), including finding gene names and symbols, prioritizing documents for curation and assigning ontology concepts. In response to a call for participation in BioCreative 2012 Interactive Text Mining task (13), we developed PubTator, a web-based application that provides computer assistance to biocurators (14). PubTator has several unique features that distinguish it from existing annotation and literature search tools (15–17), as it is designed specifically for the needs of biocurators who have limited text-mining experience. First, PubTator is a web-based system; thus, no installation is required and not restricted to any specific computer platforms. Second, PubTator is an all-in-one system that provides one-stop service for literature curation from searching and retrieving relevant articles to annotating selected articles. As such, user input can either be a search query or a list of PubMed articles. When manual curation is completed, users can readily download and export their annotations for database integration. Third, PubTator is designed in a PubMed-like interface, which many biocurators find it to be familiar and easy to use with minimal training required. Fourth, multiple competition-winning text-mining approaches have been integrated into PubTator for automatically identifying key biological entities (18,19). Hence, it provides state-of-the-art performance on generating automatic computer pre-annotations in computer-assisted biocuration. Finally, PubTator is adaptable to different annotation tasks and also allows its users to personalize their own annotation environment.

SYSTEM DESCRIPTION

Pre-annotating PubMed articles using text-mining tools

PubTator houses the entire content of PubMed and keeps current with nightly updates. To enable entity-specific semantic searches and provide pre-annotations for computer-assisted biocuration, automatic text-mining tools are applied to all articles with respect to genes, diseases, species, chemicals and mutations. More specifically, we not only find the occurrences of those entities in text but also map all entity mentions to standard database or controlled vocabulary identifiers as shown in Table 1. To ensure high quality of automatically processed results, we used tools that have been extensively evaluated for superlative performance in various text-mining competition events. Our entity recognition tools include GeneTUKit (19) for gene mention, GenNorm (18) for gene normalization, SR4GN (20) for species, DNorm (Leaman et al., 2013, under consideration; http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm/) for diseases, tmVar (21) for mutations and a dictionary-based lookup approach (8) for chemicals. SR4GN was also used for associating recognized species with their corresponding gene/protein mentions so that we were able to perform cross-species gene normalization in PubTator.

Table 1.

Text-mining tools used for pre-annotating bio-entities in PubMed articles

Bio-entity	Text-mining tool	Nomenclature	F₁ score (%)
Gene (mention)	GeneTUKit	N/A	82.97
Gene (normalization)	GenNorm	NCBI Gene	92.89
Disease	DNorm	MEDIC	80.90
Species	SR4GN	NCBI Taxonomy	85.42
Chemical	A dictionary-based lookup approach	MeSH	53.82
Mutation	tmVar	NCBI dbSNP (rs#) or tmVar normalized forms	93.98

The reported F1 scores (http://en.wikipedia.org/wiki/F1_score) of different tools were either taken from their corresponding publications or assessed by us on public benchmarking datasets. MEDIC is a disease vocabulary created by Comparative Toxicogenomics Database. All other vocabularies are products of National Library of Medicine. Separate tools are used for identifying gene names in abstracts (mention) and assigning NCBI Gene identifiers to those mentions (normalization).

Text-mining tools used for pre-annotating bio-entities in PubMed articles The reported F1 scores (http://en.wikipedia.org/wiki/F1_score) of different tools were either taken from their corresponding publications or assessed by us on public benchmarking datasets. MEDIC is a disease vocabulary created by Comparative Toxicogenomics Database. All other vocabularies are products of National Library of Medicine. Separate tools are used for identifying gene names in abstracts (mention) and assigning NCBI Gene identifiers to those mentions (normalization).

Search function in PubTator

PubTator supports both keyword searches and semantic searches with respect to specific bio-entities. As shown in Figure 1, five search options are currently available in PubTator:

Figure 1.

The PubTator homepage with five different search options.

PubMed: return results identical to PubMed search results Gene: return all articles relevant to a specific gene or gene product Chemical: return all articles relevant to a specific chemical Disease: return all articles relevant to a specific disease or syndrome PMID List: return articles in the PubMed Identifier (PMID) upload order The PubTator homepage with five different search options. The first search option (PubMed) is implemented using the NCBI’s Entrez Programming Utilities Web service API (http://www.ncbi.nlm.nih.gov/books/NBK25500/). The next three semantic search options are based on pre-computed results of the different text-mining tools as shown in Table 1. As biological entities are often associated with multiple names, our semantic search feature allows users to retrieve all the articles relevant to an entity without having to enumerate the entire set of possible aliases (22). For instance, searching for the breast cancer gene ERBB2 will also retrieve articles containing only its alternative names such as HER2 (e.g. see Result 2 in Figure 2). The last search option (PMID List) is provided for users who already have a list of relevant articles for curation.

Figure 2.

The PubTator search results page. Automatically computed entities are highlighted in colours. Unlike PubMed, article abstracts can be displayed here without going to a different page.

The PubTator search results page. Automatically computed entities are highlighted in colours. Unlike PubMed, article abstracts can be displayed here without going to a different page. Same as PubMed, PubTator returns search results in the reverse chronological order for all search options except PMID List. However, only 15 results are returned per page in PubTator instead of 20 in PubMed, making it possible for users to glance at the abstract on the search results page as shown in Figure 2. Different from PubMed, pre-computed biological entities are highlighted in each article when applicable: gene (purple), chemicals (green), diseases (orange), mutation (brown) and species (blue). A search filter (by taxonomy) is provided for those biocuration teams who work with a specific organism because by default we show results across all species.

Annotation function in PubTator

Currently, PubTator supports three annotation tasks: document triage, entity annotation and relationship annotation. In document triage, biocurators are engaged in selecting and prioritizing curatable articles based on the reading of the article. As a pre-step for full curation, users can readily identify curatable articles in two simple mechanisms using PubTator: First, a user can select articles from the search results by simply checking the box next to the articles (see Figure 2). Second, a user can indicate whether an article is curatable at the top of the annotation page (see Figure 3).

Figure 3.

The PubTator annotation page. The two radio buttons (Curatable/Not Curatable) at the top of the page is designed for document triage. The text box and the table below are used for entity annotation. The relationship table at the bottom of the page is for relationship annotation. In Mention View, each row corresponds to an entity mention. In Concept View (default), different mentions of the same concept (i.e. having the same identifier) are combined and displayed in the same row. PubTator can be used for annotating bio-entities of any kind by following steps detailed in our online tutorial page (http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/tutorial/index.html#DefineBioconcepts). PubTator provides automated pre-annotations for five common types (shown in Table 1). As shown in Figure 3, pre-computed bio-entities are highlighted in colours in the text box and also displayed in the table below where both mentions and corresponding identifiers are stored. A user can modify and remove an existing annotation as well as insert a new one. To improve efficiency, once a new annotation is made to an entity, there is an option to propagate the annotation throughout the article for the same entity. Once completed, all annotations will be saved to our database for download. Finally, PubTator can be used for annotating relationships between entities (http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/tutorial/index.html#DefineBiorelations). PubTator allows curators to specify the kind of relations they desire to capture from literature, which can be either between the same kind of entities, such as protein–protein interactions, or between different kinds, such as gene–disease relations. PubTator ensures that the entity types selected by the user are consistent with what is specified by the relationship definition.

System adaptability

Instead of being a tool for a specific curation group, we aim to make PubTator adaptable to different curation needs. For instance, PubTator allows its users to define their own entity types and controlled vocabularies for annotating mentions and their corresponding concept identifiers, respectively. This is particularly useful for assigning gene and protein identifiers where curators from model organism groups may prefer using their own gene nomenclature (e.g. the Arabidopsis Genome Initiative locus identifiers) as opposed to the default NCBI Gene identifiers. However, for user-defined entity types or nomenclature, PubTator does not provide automatic pre-annotations. In addition to PubMed articles, PubTator may also be used to process other types of biomedical text (e.g. annotating grants data). In such cases, the input text can be first uploaded to PubTator according to a specific format and then immediately processed by different text-mining tools on the fly.

EVALUTION RESULTS

PubTator has been formally evaluated through its participation in the interactive text-mining track of BioCreative 2012 workshop (13). PubTator improved both manual curation efficiency curation and accuracy in user studies for two curation tasks: document triage and gene indexing (23). After the task, the interactive text-mining track organizers conducted a survey to help identify strengths and weakness of different systems with regards to system design, usability and so forth. The survey results show that PubTator has top ratings in many aspects of biocuration from system design, to learnability, to usability. Overall, PubTator was the highest rated and most recommended among all participating systems (13).

CONCLUSIONS

There is an increasing need for automatic computer tools to assist many biocuration tasks, including prioritizing articles for full curation and annotating key biological concepts. PubTator was developed in response to these needs. In particular, PubTator provides users with many advanced text-mining tools through an easy-to-use graphical interface that is accessible through the web. Based on the previous user studies, we believe PubTator can provide practical benefits to biocurators in their routine curation work. Future work includes further improvement of existing text-mining algorithms and the integration additional text-mining tools for better support of ontology concept annotation, which was identified as a critical need in biocuration in recent studies (6,24). We also plan to investigate different search algorithms and full-text process in the future PubTator development.

FUNDING

Intramural Research Program of the NIH, National Library of Medicine. Funding for open access charge: U.S. National Library of Medicine. Conflict of interest statement. None declared.

23 in total

1. Cross-species gene normalization by species inference.

Authors: Chih-Hsuan Wei; Hung-Yu Kao
Journal: BMC Bioinformatics Date: 2011-10-03 Impact factor: 3.169

2. MyMiner: a web application for computer-assisted biocuration and text annotation.

Authors: David Salgado; Martin Krallinger; Marc Depaule; Elodie Drula; Ashish V Tendulkar; Florian Leitner; Alfonso Valencia; Christophe Marcelle
Journal: Bioinformatics Date: 2012-07-12 Impact factor: 6.937

3. tmVar: a text mining approach for extracting sequence variants in biomedical literature.

Authors: Chih-Hsuan Wei; Bethany R Harris; Hung-Yu Kao; Zhiyong Lu
Journal: Bioinformatics Date: 2013-04-05 Impact factor: 6.937

Review 4. A survey on annotation tools for the biomedical literature.

Authors: Mariana Neves; Ulf Leser
Journal: Brief Bioinform Date: 2012-12-18 Impact factor: 11.622

5. Textpresso: an ontology-based information retrieval and extraction system for biological literature.

Authors: Hans-Michael Müller; Eimear E Kenny; Paul W Sternberg
Journal: PLoS Biol Date: 2004-09-21 Impact factor: 8.029

6. Big data: The future of biocuration.

Authors: Doug Howe; Maria Costanzo; Petra Fey; Takashi Gojobori; Linda Hannick; Winston Hide; David P Hill; Renate Kania; Mary Schaeffer; Susan St Pierre; Simon Twigger; Owen White; Seung Yon Rhee
Journal: Nature Date: 2008-09-04 Impact factor: 49.962

7. GeneTUKit: a software for document-level gene normalization.

Authors: Minlie Huang; Jingchen Liu; Xiaoyan Zhu
Journal: Bioinformatics Date: 2011-02-08 Impact factor: 6.937

8. Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE.

Authors: Aurélie Névéol; W John Wilbur; Zhiyong Lu
Journal: Database (Oxford) Date: 2012-06-08 Impact factor: 3.451

9. Using ODIN for a PharmGKB revalidation experiment.

Authors: Fabio Rinaldi; Simon Clematide; Yael Garten; Michelle Whirl-Carrillo; Li Gong; Joan M Hebert; Katrin Sangkuhl; Caroline F Thorn; Teri E Klein; Russ B Altman
Journal: Database (Oxford) Date: 2012-04-23 Impact factor: 3.451

10. An overview of the BioCreative 2012 Workshop Track III: interactive text mining task.

Authors: Cecilia N Arighi; Ben Carterette; K Bretonnel Cohen; Martin Krallinger; W John Wilbur; Petra Fey; Robert Dodson; Laurel Cooper; Ceri E Van Slyke; Wasila Dahdul; Paula Mabee; Donghui Li; Bethany Harris; Marc Gillespie; Silvia Jimenez; Phoebe Roberts; Lisa Matthews; Kevin Becker; Harold Drabkin; Susan Bello; Luana Licata; Andrew Chatr-aryamontri; Mary L Schaeffer; Julie Park; Melissa Haendel; Kimberly Van Auken; Yuling Li; Juancarlos Chan; Hans-Michael Muller; Hong Cui; James P Balhoff; Johnny Chi-Yang Wu; Zhiyong Lu; Chih-Hsuan Wei; Catalina O Tudor; Kalpana Raja; Suresh Subramani; Jeyakumar Natarajan; Juan Miguel Cejuela; Pratibha Dubey; Cathy Wu
Journal: Database (Oxford) Date: 2013-01-17 Impact factor: 3.451

202 in total

1. Beyond accuracy: creating interoperable and scalable text-mining web services.

Authors: Chih-Hsuan Wei; Robert Leaman; Zhiyong Lu
Journal: Bioinformatics Date: 2016-02-16 Impact factor: 6.937

2. SimConcept: A Hybrid Approach for Simplifying Composite Named Entities in Biomedicine.

Authors: Chih-Hsuan Wei; Robert Leaman; Zhiyong Lu
Journal: ACM BCB Date: 2014

3. Scalable Text Mining Assisted Curation of Post-Translationally Modified Proteoforms in the Protein Ontology.

Authors: Karen E Ross; Darren A Natale; Cecilia Arighi; Sheng-Chih Chen; Hongzhan Huang; Gang Li; Jia Ren; Michael Wang; K Vijay-Shanker; Cathy H Wu
Journal: CEUR Workshop Proc Date: 2016-11-29

4. SimConcept: a hybrid approach for simplifying composite named entities in biomedical text.

Authors: Chih-Hsuan Wei; Robert Leaman; Zhiyong Lu
Journal: IEEE J Biomed Health Inform Date: 2015-04-13 Impact factor: 5.772

Review 5. Community challenges in biomedical text mining over 10 years: success, failure and the future.

Authors: Chung-Chi Huang; Zhiyong Lu
Journal: Brief Bioinform Date: 2015-05-01 Impact factor: 11.622

6. tmChem: a high performance approach for chemical named entity recognition and normalization.

Authors: Robert Leaman; Chih-Hsuan Wei; Zhiyong Lu
Journal: J Cheminform Date: 2015-01-19 Impact factor: 5.514

Review 7. A survey of current trends in computational drug repositioning.

Authors: Jiao Li; Si Zheng; Bin Chen; Atul J Butte; S Joshua Swamidass; Zhiyong Lu
Journal: Brief Bioinform Date: 2015-03-31 Impact factor: 11.622

8. PubMed Text Similarity Model and its application to curation efforts in the Conserved Domain Database.

Authors: Rezarta Islamaj; W John Wilbur; Natalie Xie; Noreen R Gonzales; Narmada Thanki; Roxanne Yamashita; Chanjuan Zheng; Aron Marchler-Bauer; Zhiyong Lu
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

9. A new version of the ANDSystem tool for automatic extraction of knowledge from scientific publications with expanded functionality for reconstruction of associative gene networks by considering tissue-specific gene expression.

Authors: Vladimir A Ivanisenko; Pavel S Demenkov; Timofey V Ivanisenko; Elena L Mishchenko; Olga V Saik
Journal: BMC Bioinformatics Date: 2019-02-05 Impact factor: 3.169

10. Integrating image caption information into biomedical document classification in support of biocuration.

Authors: Xiangying Jiang; Pengyuan Li; James Kadin; Judith A Blake; Martin Ringwald; Hagit Shatkay
Journal: Database (Oxford) Date: 2020-01-01 Impact factor: 3.451