| Literature DB >> 20618981 |
Kristina M Hettne1, Erik M van Mulligen, Martijn J Schuemie, Bob Ja Schijvenaars, Jan A Kors.
Abstract
BACKGROUND: Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule.Entities:
Year: 2010 PMID: 20618981 PMCID: PMC2895736 DOI: 10.1186/2041-1480-1-5
Source DB: PubMed Journal: J Biomed Semantics
New terms generated by the rewrite rules and terms suppressed by the suppression rules.
| Rule | Terms in thesaurus |
|---|---|
| Original | 2,696,820 |
| Syntactic inversion | 231,976 |
| Possessives | 10,388 |
| Short/long form | 288 |
| Angular brackets | 2,824 |
| Semantic type | 7,231 |
| Begin parentheses | 376 |
| End parentheses | 45,265 |
| Begin brackets | 11,402 |
| End brackets | 17,620 |
| Dosages | 171,369 |
| Short token | 2,044 |
| At-sign | 123 |
| EC numbers | 161 |
| Any classification | 5,299 |
| Any underspecification | 40,237 |
| Miscellaneous | 37,885 |
| Words > 5 | 653,128 |
"Terms in thesaurus" indicates the number of new terms generated by the rewrite rules and the number of terms suppressed by the suppression rules, for every rule. The row "Original" indicates the total number of terms in the thesaurus when no rewrite or suppression rule was applied.
Number of homonyms (%) generated for every rewrite rule.
| Rewrite rule | No of homonyms (%) |
|---|---|
| Syntactic inversion | 303 (0.1) |
| Possessives | 40 (0.4) |
| Short/long form | 321 (52.7) |
| Angular brackets | 218 (7.2) |
| Semantic type | 130 (1.8) |
| Begin parentheses | 28 (6.9) |
| End parentheses | 5,505 (10.8) |
| Begin brackets | 249 (2.1) |
| End brackets | 37,083 (67.8) |
The percentage is relative to the total number of rewritten terms for every rule.
Rewritten or suppressed terms and concepts found in the corpus.
| Rule | Terms in corpus (all) | Terms in corpus (distinct) | Concepts in corpus (distinct) |
|---|---|---|---|
| Original | 3,992,662,340 | 651,268 | 397,414 |
| Syntactic inversion | 529,058 | 12,433 | 11,291 |
| Possessives | 34,211 | 1,134 | 946 |
| Short/long form | 305,541 | 216 | 182 |
| Angular brackets | 30,124 | 743 | 731 |
| Semantic type | 218,838 | 259 | 259 |
| Begin parentheses | 523 | 26 | 25 |
| End parentheses | 8,916,764 | 4,776 | 4,494 |
| Begin brackets | 176,791 | 274 | 251 |
| End brackets | 65,873 | 241 | 236 |
| Dosages | 109,246 | 5,014 | 4,885 |
| Short token | 1,906,901,846 | 1009 | 945 |
| At-sign | 0 | 0 | 0 |
| EC numbers | 45,138 | 149 | 146 |
| Any classification | 6,972 | 42 | 36 |
| Any underspecification | 9,470 | 322 | 290 |
| Miscellaneous | 91,576,083 | 1,257 | 1,095 |
| Words > 5 | 179,051 | 5,734 | 4,665 |
"Terms in corpus (all)" indicates the number of occurrences of the new terms generated by the rewrite rules and the terms suppressed by the suppression rules in the corpus. "Terms in corpus (distinct)" and "Concepts in corpus (distinct)" indicate the number of unique terms and concepts produced or suppressed by the rules that were found in the corpus. The row "Original" indicates the total number of terms found in corpus when no rewrite or suppression rule was applied.
Number of correct and incorrect terms for each of the rewrite and suppression rules.
| Rule | Most frequent | Random | ||
|---|---|---|---|---|
| Syntactic inversion | 50 | 0 | 100 | 0 |
| Possessives | 50 | 0 | 100 | 0 |
| Short/long form | 49 | 1 | 98 | 2 |
| Angular brackets | 50 | 0 | 97 | 3 |
| Semantic type | 50 | 0 | 100 | 0 |
| Begin parentheses | 1 | 25 | - | - |
| End parentheses | 49 | 1 | 96 | 4 |
| Begin brackets | 38 | 12 | 91 | 9 |
| End brackets | 46 | 4 | 95 | 5 |
| Dosages | 50 | 0 | 100 | 0 |
| Short token | 50 | 0 | 100 | 0 |
| At-sign | - | - | - | - |
| EC numbers | 50 | 0 | 99 | 0 |
| Any classification | 50 | 0 | 100 | 0 |
| Any underspecification | 50 | 0 | 100 | 0 |
| Miscellaneous | 50 | 0 | 100 | 0 |
| Words > 5 | 0 | 50 | 5 | 95 |
The calculations are based on the, for every rule, 50 most frequently found terms in the corpus and 100 randomly selected terms in the corpus (if available). The At-sign rule has no values because terms suppressed by this rule were not found in the corpus.