| Literature DB >> 24914232 |
Rezarta Islamaj Doğan1, Donald C Comeau2, Lana Yeganova2, W John Wilbur2.
Abstract
BioC is a recently created XML format to share text data and annotations, and an accompanying input/output library to promote interoperability of data and tools for natural language processing of biomedical text. This article reports the use of BioC to address a common challenge in processing biomedical text information-that of frequent entity name abbreviation. We selected three different abbreviation definition identification modules, and used the publicly available BioC code to convert these independent modules into BioC-compatible components that interact seamlessly with BioC-formatted data, and other BioC-compatible modules. In addition, we consider four manually annotated corpora of abbreviations in biomedical text: the Ab3P corpus of 1250 PubMed abstracts, the BIOADI corpus of 1201 PubMed abstracts, the old MEDSTRACT corpus of 199 PubMed(®) citations and the Schwartz and Hearst corpus of 1000 PubMed abstracts. Annotations in these corpora have been re-evaluated by four annotators and their consistency and quality levels have been improved. We converted them to BioC-format and described the representation of the annotations. These corpora are used to measure the three abbreviation-finding algorithms and the results are given. The BioC-compatible modules, when compared with their original form, have no difference in their efficiency, running time or any other comparable aspects. They can be conveniently used as a common pre-processing step for larger multi-layered text-mining endeavors. Database URL: Code and data are available for download at the BioC site: http://bioc.sourceforge.net. Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.Entities:
Mesh:
Year: 2014 PMID: 24914232 PMCID: PMC4051513 DOI: 10.1093/database/bau044
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Results of BioC-compliant abbreviation detection modules when tested on BioC-formatted abbreviation corpora
| Corpus/Shwartz&Hearst results | Ab3P | BIOADI | MEDSTRACT | Schwartz&Hearst |
|---|---|---|---|---|
| Shwartz&Hearst results | ||||
| Precision | 0.950 | 0.943 | 0.986 | 0.928 |
| Recall | 0.788 | 0.765 | 0.893 | 0.763 |
| F-score | 0.861 | 0.844 | 0.937 | 0.837 |
| Ab3P results | ||||
| Precision | 0.971 | 0.952 | 0.993 | 0.929 |
| Recall | 0.836 | 0.770 | 0.906 | 0.770 |
| F-score | 0.898 | 0.851 | 0.947 | 0.842 |
| NatLAb results | ||||
| Precision | 0.927 | 0.885 | 0.924 | 0.856 |
| Recall | 0.879 | 0.833 | 0.918 | 0.824 |
| F-score | 0.903 | 0.858 | 0.921 | 0.840 |
Figure 1Illustration of abbreviation annotation in BioC format.
Figure 2A graphical representation of abbreviation annotations in BioC format. The excerpt from one of the corpus documents contains multi-segmented abbreviation long forms. The traditional ShortForm, LongForm pairing is shown in the figure, as well as the infons detailing BioC annotations for an abbreviation, and the relation between them, with the corresponding precise text offsets.
Characteristics of abbreviation definition corpora in biomedical literature
| Corpora | Ab3P | BIOADI | MEDSTRACT | Schwartz&Hearst |
|---|---|---|---|---|
| Number of abstracts | 1250 | 1201 | 199 | 1000 |
| Number of abbreviation definitions | 1223 | 1720 | 159 | 979 |
| Number of unique abbreviation definitions (across the whole corpus) | 1113 | 1421 | 152 | 842 |
| Number of unique abbreviations (ShortForms) | 998 | 1330 | 146 | 724 |
The overlap between corpora identified as number of documents that they have in common
| Corpora | Ab3P | BioADI | Medstract | Schwartz&Hearst |
|---|---|---|---|---|
| Ab3P | 1250 | 0 | 0 | 1 |
| BioADI | 1201 | 0 | 6 | |
| Medstract | 199 | 2 | ||
| Schwartz&Hearst | 1000 |
Figure 3Word cloud (http://www.wordle.net/create) representations of MeSH terms found in each corpus: Ab3P (top left), BIOADI (top right), MEDSTRACT (bottom left) and Schwartz and Hearst (bottom right). The MeSH terms confirm each corpus’ original intent: Ab3P was intended as a representation of all biomedical literature in PubMed, BIOADI is the corpus used in the BioCreative II gene normalization challenge, half of MEDSTRACT documents were a result of the search term ‘gene’ on MEDLINE restricted to a small group of biomedical journals and Schwartz and Hearst was a selection of documents returned as a result of the search term ‘yeast’ applied to PubMed.