| Literature DB >> 20360059 |
Naoaki Okazaki1, Sophia Ananiadou, Jun'ichi Tsujii.
Abstract
MOTIVATION: The ultimate goal of abbreviation management is to disambiguate every occurrence of an abbreviation into its expanded form (concept or sense). To collect expanded forms for abbreviations, previous studies have recognized abbreviations and their expanded forms in parenthetical expressions of bio-medical texts. However, expanded forms extracted by abbreviation recognition are mixtures of concepts/senses and their term variations. Consequently, a list of expanded forms should be structured into a sense inventory, which provides possible concepts or senses for abbreviation disambiguation.Entities:
Mesh:
Year: 2010 PMID: 20360059 PMCID: PMC2859134 DOI: 10.1093/bioinformatics/btq129
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Term variation and ambiguity of the abbreviation PCR.
Fig. 2.Work flow of the proposed system.
Features for the string similarity measure
| # | Feature | Type | Description | Example | Weight ( |
|---|---|---|---|---|---|
| 1 | Character | Real | Cosine similarity of letter | (0.954, 0.953, 0.951) | (1.037, 3.838, 9.043) |
| 2 | Normalized Levenshtein distance | Real | The minimum number of insertions, deletions and substitution operations necessary to transform one term into the other (Levenshtein, | 0.061 | 2.742 |
| 3 | Jaro–Winkler similarity (Winkler, | Real | This metric considers the number of shared letters and transpositions between two terms; the metric also incorporates a formula to favor two terms that match from the beginning. | 0.979 | −0.536 |
| 4 | Word | Real | Cosine similarity of word | (0.750, 0.667, 0.500) | (0.457, −2.439, 0.523) |
| 5 | SoftTFIDF (Cohen | Real | This metric aligns tokens between two strings using the Jaro–Winkler similarity with threshold 0.9, and computes the sum of the similarity scores of aligned pairs; the similarity score is based on TFIDF scores. | 1.883 | 0.946 |
| 6 | Bias | Real | This feature always yields 1. | 1 | −9.340 |
Rules to generate features for classifiers
| Feature type | Unit | Effective region (window) |
|---|---|---|
| Neighbor context | uni | Previous and next words to the abbreviation |
| Local context | uni, bi | Three words previous and next to the abbreviation |
| Sentence context | uni, bi | Words in the same sentence for |
| Abstract context | uni, bi | Words in the same abstract for |
Fig. 3.Excerpt of the clusters of expanded forms.
Feature contributions for the similarity metric
| Features | A | P | R | F1 | ΔF1 |
|---|---|---|---|---|---|
| Sim (ch) | 0.963 | 0.879 | 0.895 | 0.887 | |
| Sim (wd) | 0.937 | 0.844 | 0.747 | 0.793 | |
| Sim (ch + wd) | 0.962 | 0.877 | 0.890 | 0.884 | |
| Levenshtein | 0.939 | 0.849 | 0.754 | 0.799 | |
| Jaro–Winkler | 0.918 | 0.920 | 0.534 | 0.676 | |
| SoftTFIDF | 0.921 | 0.817 | 0.656 | 0.728 | |
| 0.965 | 0.883 | 0.900 | 0.892 | ||
| - Sim (ch) | 0.947 | 0.855 | 0.808 | 0.831 | −0.061 |
| - Sim (wd) | 0.965 | 0.885 | 0.898 | 0.891 | −0.001 |
| - Sim (ch+wd) | 0.950 | 0.868 | 0.810 | 0.838 | −0.054 |
| - Levenshtein | 0.965 | 0.882 | 0.898 | 0.890 | −0.002 |
| - Jaro–Winkler | 0.965 | 0.882 | 0.899 | 0.891 | −0.001 |
| - SoftTFIDF | 0.965 | 0.882 | 0.901 | 0.892 | −0.000 |
ΔF1, the difference of F1 score from the Full feature set; Sim (ch), character n-gram similarity; Sim (wd), word n-gram similarity
Fig. 4.Performance of clustering with different algorithms.
Number of database records including names conflicting with abbreviations with at least k senses.
| UniProt, | UMLS genes, | UMLS acids, | |
|---|---|---|---|
| ≥0 | 466 739 ( 100) | 29 194 ( 100) | 116 011 ( 100) |
| ≥1 | 149 537 (32.0) | 7525 (25.8) | 17 854 (15.4) |
| ≥2 | 77 833 (16.7) | 3852 (13.2) | 7424 ( 6.4) |
| ≥3 | 56 430 (12.1) | 2982 (10.2) | 5277 (4.5) |
| … | … | … | |
| ≥30 | 4841 (1.0) | 426 ( 1.5) | 507 ( 0.4) |
Performance of abbreviation disambiguation
| Features | A | P | R | F1 |
|---|---|---|---|---|
| Majority | 0.789 | 0.621 | 0.663 | 0.636 |
| Majority (w/o clustering) | 0.760 | 0.571 | 0.619 | 0.588 |
| Proposed | 0.984 | 0.992 | 0.984 | 0.986 |
| Proposed (w/o clustering) | 0.801 | 0.854 | 0.831 | 0.830 |
| +Neighbor | 0.925 | 0.961 | 0.929 | 0.934 |
| +Local | 0.952 | 0.980 | 0.955 | 0.961 |
| +Sentence | 0.967 | 0.987 | 0.967 | 0.973 |
| +Abstract | 0.982 | 0.992 | 0.983 | 0.986 |
| - Abstract | 0.968 | 0.988 | 0.968 | 0.974 |
| - Abstract - Neighbor | 0.968 | 0.988 | 0.968 | 0.974 |
| - Abstract - Local | 0.968 | 0.987 | 0.968 | 0.973 |
| - Abstract - Sentence | 0.953 | 0.980 | 0.956 | 0.962 |
Performance of disambiguating the 400 abbreviations
| Clustering | Evaluation | A | P | R | F1 |
|---|---|---|---|---|---|
| Gold-standard | Gold-standard | 0.992 | 0.989 | 0.979 | 0.982 |
| Automatic | Automatic | 0.993 | 0.991 | 0.980 | 0.983 |
| Automatic | Gold-standard | 0.993 | 0.991 | 0.978 | 0.982 |
| No | Gold-standard | 0.984 | 0.980 | 0.963 | 0.968 |