Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 NCBI disease corpus: a resource for disease name recognition and concept normalization.

Literature DB >> 24393765

NCBI disease corpus: a resource for disease name recognition and concept normalization.

Rezarta Islamaj Doğan¹, Robert Leaman², Zhiyong Lu³.

Abstract

Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency. The public release of the NCBI disease corpus contains 6892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. The NCBI disease corpus, guidelines and other associated resources are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/. Published by Elsevier Inc.

Entities: Disease Gene Species

Keywords: Corpus annotation; Disease name corpus; Disease name normalization; Disease name recognition; Named entity recognition

Mesh：

Year: 2014 PMID： 24393765 PMCID： PMC3951655 DOI： 10.1016/j.jbi.2013.12.006

Source DB: PubMed Journal: J Biomed Inform ISSN： 1532-0464 Impact factor: 6.317

33 in total

1. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program.

Authors: A R Aronson
Journal: Proc AMIA Symp Date: 2001

2. Extracting Rx information from clinical narrative.

Authors: James G Mork; Olivier Bodenreider; Dina Demner-Fushman; Rezarta Islamaj Dogan; François-Michel Lang; Zhiyong Lu; Aurélie Névéol; Lee Peters; Sonya E Shooshan; Alan R Aronson
Journal: J Am Med Inform Assoc Date: 2010 Sep-Oct Impact factor: 4.497

3. Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction.

Authors: Aurélie Névéol; Rezarta Islamaj Doğan; Zhiyong Lu
Journal: J Biomed Inform Date: 2010-11-20 Impact factor: 6.317

4. BANNER: an executable survey of advances in biomedical named entity recognition.

Authors: Robert Leaman; Graciela Gonzalez
Journal: Pac Symp Biocomput Date: 2008

5. Concept annotation in the CRAFT corpus.

Authors: Michael Bada; Miriam Eckert; Donald Evans; Kristin Garcia; Krista Shipley; Dmitry Sitnikov; William A Baumgartner; K Bretonnel Cohen; Karin Verspoor; Judith A Blake; Lawrence E Hunter
Journal: BMC Bioinformatics Date: 2012-07-09 Impact factor: 3.169

6. Disease Ontology: a backbone for disease semantic integration.

Authors: Lynn Marie Schriml; Cesar Arze; Suvarna Nadendla; Yu-Wei Wayne Chang; Mark Mazaitis; Victor Felix; Gang Feng; Warren Alden Kibbe
Journal: Nucleic Acids Res Date: 2011-11-12 Impact factor: 16.971

7. Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis.

Authors: Stephen T Wu; Hongfang Liu; Dingcheng Li; Cui Tao; Mark A Musen; Christopher G Chute; Nigam H Shah
Journal: J Am Med Inform Assoc Date: 2012-04-04 Impact factor: 4.497

8. Construction of an annotated corpus to support biomedical information extraction.

Authors: Paul Thompson; Syed A Iqbal; John McNaught; Sophia Ananiadou
Journal: BMC Bioinformatics Date: 2009-10-23 Impact factor: 3.169

9. Overview of BioCreative II gene normalization.

Authors: Alexander A Morgan; Zhiyong Lu; Xinglong Wang; Aaron M Cohen; Juliane Fluck; Patrick Ruch; Anna Divoli; Katrin Fundel; Robert Leaman; Jörg Hakenberg; Chengjie Sun; Heng-hui Liu; Rafael Torres; Michael Krauthammer; William W Lau; Hongfang Liu; Chun-Nan Hsu; Martijn Schuemie; K Bretonnel Cohen; Lynette Hirschman
Journal: Genome Biol Date: 2008-09-01 Impact factor: 13.583

Review 10. Forty years of SNOMED: a literature review.

Authors: Ronald Cornet; Nicolette de Keizer
Journal: BMC Med Inform Decis Mak Date: 2008-10-27 Impact factor: 2.796

100 in total

1. Beyond accuracy: creating interoperable and scalable text-mining web services.

Authors: Chih-Hsuan Wei; Robert Leaman; Zhiyong Lu
Journal: Bioinformatics Date: 2016-02-16 Impact factor: 6.937

2. SimConcept: A Hybrid Approach for Simplifying Composite Named Entities in Biomedicine.

Authors: Chih-Hsuan Wei; Robert Leaman; Zhiyong Lu
Journal: ACM BCB Date: 2014

3. SimConcept: a hybrid approach for simplifying composite named entities in biomedical text.

Authors: Chih-Hsuan Wei; Robert Leaman; Zhiyong Lu
Journal: IEEE J Biomed Health Inform Date: 2015-04-13 Impact factor: 5.772

Review 4. Community challenges in biomedical text mining over 10 years: success, failure and the future.

Authors: Chung-Chi Huang; Zhiyong Lu
Journal: Brief Bioinform Date: 2015-05-01 Impact factor: 11.622

5. Challenges in clinical natural language processing for automated disorder normalization.

Authors: Robert Leaman; Ritu Khare; Zhiyong Lu
Journal: J Biomed Inform Date: 2015-07-14 Impact factor: 6.317

6. PubTator central: automated concept annotation for biomedical full text articles.

Authors: Chih-Hsuan Wei; Alexis Allot; Robert Leaman; Zhiyong Lu
Journal: Nucleic Acids Res Date: 2019-07-02 Impact factor: 16.971

7. A graph-based method for reconstructing entities from coordination ellipsis in medical text.

Authors: Chi Yuan; Yongli Wang; Ning Shang; Ziran Li; Ruxin Zhao; Chunhua Weng
Journal: J Am Med Inform Assoc Date: 2020-07-01 Impact factor: 4.497

8. RadLex Normalization in Radiology Reports.

Authors: Surabhi Datta; Jordan Godfrey-Stovall; Kirk Roberts
Journal: AMIA Annu Symp Proc Date: 2021-01-25

Review 9. Crowdsourcing in biomedicine: challenges and opportunities.

Authors: Ritu Khare; Benjamin M Good; Robert Leaman; Andrew I Su; Zhiyong Lu
Journal: Brief Bioinform Date: 2015-04-17 Impact factor: 11.622

10. Recognizing Question Entailment for Medical Question Answering.

Authors: Asma Ben Abacha; Dina Demner-Fushman
Journal: AMIA Annu Symp Proc Date: 2017-02-10