| Literature DB >> 18817555 |
Sunghwan Sohn1, Donald C Comeau, Won Kim, W John Wilbur.
Abstract
BACKGROUND: The rapid growth of biomedical literature presents challenges for automatic text processing, and one of the challenges is abbreviation identification. The presence of unrecognized abbreviations in text hinders indexing algorithms and adversely affects information retrieval and extraction. Automatic abbreviation definition identification can help resolve these issues. However, abbreviations and their definitions identified by an automatic process are of uncertain validity. Due to the size of databases such as MEDLINE only a small fraction of abbreviation-definition pairs can be examined manually. An automatic way to estimate the accuracy of abbreviation-definition pairs extracted from text is needed. In this paper we propose an abbreviation definition identification algorithm that employs a variety of strategies to identify the most probable abbreviation definition. In addition our algorithm produces an accuracy estimate, pseudo-precision, for each strategy without using a human-judged gold standard. The pseudo-precisions determine the order in which the algorithm applies the strategies in seeking to identify the definition of an abbreviation.Entities:
Mesh:
Year: 2008 PMID: 18817555 PMCID: PMC2576267 DOI: 10.1186/1471-2105-9-402
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Basic rules used in strategies
| Word- | |
| Word | |
| W | |
| Word | |
| Stopword | |
| Word | |
a Words are white space demarcated strings. This applies to all rules.
b A defined word is at least three letters, a non-stopword, and appears at least 100 times in MEDLINE.
Strategy description
| FirstLet: FL for all letters in SF | |
| FirstLetOneChSF: Applied for 1-letter SF. | |
| FirstLetGen: FC or FCG, at least one FCG | 1- |
| FirstLetGen2: FC or FCG | |
| FirstLetGenS: SF consists of upper-case letters and lower-case letter 's' at the end. | |
| FirstLetGenStp: FC or FCG or ST, at least one ST (at most one ST between matched words or at end) | |
| FirstLetGenStp2: FC or FCG or ST, at least one pair of adjacent ST (at most two ST between matched words or at end) | |
| FirstLetGenSkp: FC or FCG or SK, at least one SK (at most one SK between matched words or at end) | |
| WithinWrdFWrd: FC or FCG or SBW, at least one SBW, all SBW in a FC or FCG matched word in LF | |
| WithinWrdWrd: FC or FCG or SBW, at least one SBW | |
| WithinWrdFWrdSkp: WithinWrdFWrd or SK, at least one SK (at most one SK between matched words or at end) | |
| WithinWrdFLet: FC or FCG or NF, at least one NF, all NF in a FC or FCG matched word in LF | |
| WithinWrdLet: FC or FCG or NF, at least one NF | |
| WithinWrdFLetSkp: WithinWrdFLet or SK, at least one SK (at most one SK between matched words or at end) | |
| ContLet: FC or FCG or CL, at least one CL, all CL in a FC or FCG matched word in LF | |
| ContLetSkp: ContLet or SK, at least one SK (at most one SK between matched words or at end) | |
| AnyLet: The 1st character of SF: FC or FCG. The others: AC or SK (at most one SK between matched words or at end) |
a The SF and LF pair must appear in MEDLINE at least twice and LF must not be a stopword.
Figure 1Strategy ordering.
Difference between gold standard and annotators in 250 MEDLINE records.
| Annotator | Precision (%) | Recall (%) | F-measurea (%) |
| Annotator 1 | 93.4 | 89.5 | 91.4 |
| Annotator 2 | 95.6 | 91.6 | 93.5 |
| Annotator 3 | 93.4 | 89.5 | 91.4 |
| Annotator 4 | 89.7 | 73.4 | 80.7 |
| Annotator 1 & 2 | 96.1 | 92.8 | 94.4 |
| Annotator 3 & 4 | 98.7 | 96.2 | 97.4 |
a F-measure = 2*(Precision*Recall)/(Precision+Recall)
Correct SF and LF pairs identified by our algorithm.
| SF | LF | P-precision | Strategy Used |
| IBV | infectious bronchitis virus | 0.9998 | FirstLet |
| CZE | capillary zone electrophoresis | 0.9998 | FirstLet |
| PMECs | pulmonary microvascular endothelial cells | 0.9999 | FirstLetGenS |
| LCM | Lymphocytic choriomeningitis | 0.9978 | WithinWrdFWrd |
| ICG | impedance cardiogram | 0.9978 | WithinWrdFWrd |
| D-Gal | D-Galactosamine | 0.9946 | ContLet |
| Prl | prolactin | 0.9877 | ContLet |
| P | progesterone | 0.9672 | FirstLetOneChSF |
| T | Testosterone | 0.9672 | FirstLetOneChSF |
| SKY | spectral karyotyping | 0.9813 | WithinWrdFLet |
| GG | genioglossus | 0.9420 | WithinWrdFLet |
| TEV | tobacco etch potyvirus | 0.9437 | WithinWrdWrd |
| PDX1 | pancreatic duodenal homeobox factor-1 | 0.9863 | WithinWrdLet |
| GC/ECD | gas chromatography employing an electron capture detector | 0.7456 | AnyLet |
Figure 2Precision-recall curve with P-precision threshold on 1250 MEDLINE records. Some values of P-precision are labelled on the curve.
Figure 3Evaluation of strategies on 1250 MEDLINE records. Average P-precision is the mean of P-precisions of SF-LF pairs identified by that strategy (dark bars) and Precision is the number of gold standard pairs identified by a given strategy divided by the number of pairs identified by that strategy (light bars). An error bar denotes the 95% confidence interval of Precision. Long error bars correspond to small sample size for that strategy.