| Literature DB >> 24928653 |
Alex Rudniy, Min Song1, James Geller.
Abstract
BACKGROUND: The significant growth in the volume of electronic biomedical data in recent decades has pointed to the need for approximate string matching algorithms that can expedite tasks such as named entity recognition, duplicate detection, terminology integration, and spelling correction. The task of source integration in the Unified Medical Language System (UMLS) requires considerable expert effort despite the presence of various computational tools. This problem warrants the search for a new method for approximate string matching and its UMLS-based evaluation.Entities:
Mesh:
Year: 2014 PMID: 24928653 PMCID: PMC4086698 DOI: 10.1186/1471-2105-15-187
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Four medical informatics datasets used in experiments
| The UMLS most frequent concepts from multiple sources | 100 | 4,979 | 369 | |
| The SNOMED CT most frequent concepts | 155 | 5,000 | 281 | |
| The UMLS concepts with longest terms (“longest concepts”) | 3,337 | 5,000 | 1,693 | |
| The SNOMED CT longest concepts | 1,805 | 5,000 | 903 |
Figure 1Example of histogram intersection. The expression hist(S1..) ∩ hist(T1..) denotes the histogram intersection of two string prefixes. Depicts the histogram intersection of two UMLS terms, ammonium and ammonium ion. The histogram of ammonium is in a, the histogram of ammonium ion is in b. The intersection (c) is computed as the minimum for each pair of argument values of the same character, with missing values in one argument omitted from the result. For example, ammonium contains one “o” while there are two letters “o” in ammonium ion. As min (1, 2) = 1, the resulting histogram in c contains the entry “1” for the letter “o.” As there is no blank in ammonium, there is also no entry for the blank character in the resulting histogram. In order to compute the size (the “absolute value” ||) of the histogram intersection in c, the sum of all the numbers in the result matrix is calculated. For c, the size of the histogram intersection is (1 + 1 + 3 + 1 + 1 + 1) = 8.
UMLS terms sharing the same longest approximately common prefix
| 1 | Ammonium | 8 |
| 2 | Ammonium ion | 12 |
| 3 | AMMONIUM-CHLORIDE 1 MG/CYANOCOBALAMIN 5 MCG/FERRIC AMMON IUM CITRATE 40 MG/FOLIC ACID 1 MG/LYSINE HYDROCHLORIDE 100 MG/MAGNESIUM SULFATE 1 MG/MANGANESE SULFATE ANHYDROUS 1 MG/NIACIN 5 MG/PANTHENOL 1 MG/POTASSIUM SULFATE 1 MG/PYRIDOXINE HYDROCHLORIDE 0.5 MG/RIBOFLAVIN 1.2 MG/THIAMINE HYDROCHLORIDE 12 MG/ZINC SULFATE 1 MG ORAL LIQUID [HEMERGON] | 369 |
Algorithm of the LACP method
| 1 | O(1) | |
| 2 | FOR | O(n) |
| BEGIN | ||
| 3 | | O(1) |
| 4 | | O(1) |
| 5 | FOR (Char | Constant |
| BEGIN | ||
| 6 | IF | O(1) |
| 7 | THEN | O(1) |
| | ||
| 8 | END | |
| 9 | IF ( | O(1) |
| THEN RETURN | ||
| 10 | END | |
| 11 | RETURN | O(1) |
| Total complexity | O(n) | |
Average precision P
| Jaccard | 0.31 | 0.33 | 0.22 | 0.54 |
| Jaro | 0.26 | 0.40 | 0.14 | 0.69 |
| Jaro-Winkler | 0.44 | 0.45 | 0.14 | 0.69 |
| Levenshtein | 0.16 | 0.21 | 0.18 | 0.54 |
| Monge-Elkan | 0.22 | 0.32 | 0.12 | 0.65 |
| Needleman-Wunsch | 0.16 | 0.21 | 0.18 | 0.54 |
| Smith-Waterman | 0.18 | 0.16 | 0.09 | 0.34 |
| TFIDF | 0.51 | 0.69 | ||
| Soft TFIDF | 0.51 | 0.69 | ||
| LACP | 0.51 | 0.12 |
Note: The best values for each column are formatted in bold italics.
Maximum
| Jaccard | 0.33 | 0.38 | 0.37 | 0.59 |
| Jaro | 0.33 | 0.49 | 0.28 | 0.77 |
| Jaro-Winkler | 0.56 | 0.57 | 0.28 | 0.77 |
| Levenshtein | 0.21 | 0.28 | 0.33 | 0.65 |
| Monge-Elkan | 0.24 | 0.37 | 0.26 | 0.67 |
| Needleman-Wunsch | 0.21 | 0.28 | 0.33 | 0.65 |
| Smith-Waterman | 0.21 | 0.22 | 0.18 | 0.38 |
| TFIDF | 0.49 | 0.58 | 0.70 | |
| Soft TFIDF | 0.49 | 0.58 | 0.71 | |
| LACP | 0.27 |
Note: The best values for each column are formatted in bold italics.
Execution time in seconds
| Jaccard | 70 | 20 | 568 | 324 |
| Jaro | 105 | 25 | 3,637 | 1,102 |
| Jaro-Winkler | 115 | 26 | 3,617 | 1,265 |
| Levenshtein | 1,273 | 301 | 57,811 | 16,596 |
| Monge-Elkan | 6,240 | 1,340 | 258,502 | 77,555 |
| Needleman-Wunsch | 1,294 | 258 | 57,982 | 15,918 |
| Smith-Waterman | 1,444 | 293 | 58,753 | 17,519 |
| TFIDF | 132 | 37 | 928 | 558 |
| Soft TFIDF | 208 | 144 | 186,937 | 11,983 |
| LACP |
Note: The best values for each column are formatted in bold italics.
Figure 2Precision-recall curves of the evaluated methods. Figure 2 depicts four precision-recall charts plotting interpolated precision values at 11 recall levels. The horizontal axis shows 11 recall points; the vertical axis displays interpolated precision values. A method with a larger area under its curve demonstrates a better result. The differences in performance between LACP, TFIDF and Soft TFIDF are easily apparent. For D1 and D4, LACP consistently outperforms the other two methods. It is important to note, however, that on D2, LACP experiences a rapid precision drop after recall = 0.5, and that on D3, LACP is inferior to most methods.
Example of similar terms with different concept IDs from dataset
| C0602912 | Yohimban-16-carboxylic acid, 11,17-dimethoxy-18-((3,4,5-trimethoxybenzoyl)oxy)-, methyl ester, (3beta,16beta,17alpha,18beta,20alpha)-, mixt. with 4-chloro-N(1)-methyl-N(1)-((tetrahydro-2-methyl-2-furanyl)methyl)-1,3-benzenedisulfonamide and 3-hydroxy-alpha-methyl-L-tyrosine |
| C0053099 | Yohimban-16-carboxylic acid, 11,17-dimethoxy-18-((3,4,5-trimethoxybenzoyl)oxy)-, methyl ester, (3beta,16beta,17alpha,18beta,20alpha)-, mixt. with 4-chloro-N(1)-methyl-N(1)-((tetrahydro-2-methyl-2-furanyl)methyl)-1,3-benzenedisulfonamide and myo-inositol hexa-3-pyridinecarboxylate |
| C0050737 | Yohimban-16-carboxylic acid, 11,17-dimethoxy-18-((3,4,5-trimethoxybenzoyl)oxy)-, methyl ester, (3beta,16beta,17alpha,18beta,20alpha)-, mixt. with 6-chloro-3,4-dihydro-2H-1,2,4-benzothiadiazine-7-sulfonamide 1,1-dioxide and 1(2H)-phthalazinone hydrazine |
| C0600796 | Yohimban-16-carboxylic acid, 11,17-dimethoxy-18-((3,4,5-trimethoxybenzoyl)oxy)-, methyl ester, (3beta,16beta,17alpha,18beta,20alpha)-, mixt. with 6-chloro-3,4-dihydro-2H-1,2,4-benzothiadiazine-7-sulfonamide 1,1-dioxide and 5-ethyl-5-(1-methylpropyl)-2,4,6(1H,3H,5H)-pyrimidinetrione monosodium salt |
| C0602088 | Yohimban-16-carboxylic acid, 11,17-dimethoxy-18-((3,4,5-trimethoxybenzoyl)oxy)-, methyl ester, (3beta,16beta,17alpha,18beta,20alpha)-, mixt. with 6-chloro-3,4-dihydro-2H-1,2,4-benzothiadiazine-7-sulfonamide 1,1-dioxide, 1(2H)-phthalazinone hydrazone and potassium chloride (KCl) |