| Literature DB >> 23095521 |
Lina F Soualmia1, Elise Prieur-Gaston, Zied Moalla, Thierry Lecroq, Stéfan J Darmoni.
Abstract
BACKGROUND: The Internet is a major source of health information but most seekers are not familiar with medical vocabularies. Hence, their searches fail due to bad query formulation. Several methods have been proposed to improve information retrieval: query expansion, syntactic and semantic techniques or knowledge-based methods. However, it would be useful to clean those queries which are misspelled. In this paper, we propose a simple yet efficient method in order to correct misspellings of queries submitted by health information seekers to a medical online search tool.Entities:
Mesh:
Year: 2012 PMID: 23095521 PMCID: PMC3439674 DOI: 10.1186/1471-2105-13-S14-S11
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Soundex codes
| Digits | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| b, f, p, v | c, g, j, k, q, s, x, z | d, t | L | m, n | r |
Phonemisation codes
| Code | Sound | Example |
|---|---|---|
| 1 | "u"[œ] | Comm |
| 2 | "oi" [wa] | F |
| 3 | "ou" [u] | Gen |
| 4 | "en" [ã] | Sci |
| 5 | "ch"[,ſ] | Bron |
| 6 | "ill" [j] | Ore |
| 7 | "gn" [Л] | Soi |
| 8 | "é" [e] "è" [ε] "e" [ø] | Pr |
| 0 | "oin" [wœ] | S |
String modifications according to letters combinations and groups of letters before and after the combination
| Combination | Group of Letter | Modification | |
|---|---|---|---|
| Before | After | ||
| An | 'a','e','i','o','u','n','1','2','3','4','6','8','0' | 4 | |
| Am | 'a','e','i','o','u','n','m','1','2','3','4','6','7','8','0' | 4 | |
| Ein | 'a','e','i','o','u','n','1','2','3','4','6','8','0' | 1 | |
| Ain | 'a','e','i','o','u','n','1','2','3','4','6','8','0' | 1 | |
| Eim | 'a','e','i','o','u','m','1','2','3','4','6','8','0' | 1 | |
| En | 'a','e','i','o','u','n','1','2','3','4','6','8','0' | 4 | |
| Em | 'a','e','i','o','u','m','1','2','3','4','6','8','0' | 4 | |
| Oin | 'a','e','i','o','u','n','1','2','3','4','6','8','0' | 0 | |
| In | 'o', 'e', 'a' | 'a','e','i','o','u','n','1','2','3','4','6','8','0' | 1 |
| Im | 'o', 'e', 'a' | 'a','e','i','o','u','m','1','2','3','4','6','8','0' | 1 |
| Un | 'a','e','i','o','u','n','1','2','3','4','6','8','0' | 1 | |
| Ge | 'a','o','2','3','4','0' | g | |
| Gu | 'e','i','1','2','4','6','8','0' | g | |
Some modifications according to letters combinations
| Combin | Modif | Combin | Modif | Combin | Modif | Combin | Modif | ||
|---|---|---|---|---|---|---|---|---|---|
| sch | 5 | l1 | l8n | irop | iro | qu | k | 5t | kt |
| Ch | 5 | U | o | irops | iro | s | ss | 5l | kl |
| Sh | 5 | r0 | ro1 | thm | m | h | Ø | ptio | psio |
| Ai | 8 | omac | oma | stme | sm | 31 | 0 | ati4 | assi4 |
| Xs | ks | 8 mm | am | Am7 | ami | ei | 8 | Oz1 | os1 |
| o6 | 26 | si5 | sik | tion | sion | oi | 2 | q | k |
| oeu | 8 | gn | 7 | 5o | ko | c | k | 5r | kr |
Some sound matching
| Word | Phonemisation |
|---|---|
| Acupuncture | Akup1ktur |
| Tabac | Taba |
| Ville | Vil |
| Sang | S4 |
Composition of the reference dictionary based on the MeSH in French
| MeSH Terms | MeSH Synonyms | CISMeF synonyms | Total | |
|---|---|---|---|---|
| 9,679 | 9,391 | 3,359 | 22,429 | |
| 9,833 | 28,051 | 8,258 | 46,142 | |
| 4,204 | 19,551 | 6,569 | 30,324 | |
| 2,503 | 16,992 | 4,924 | 24,419 | |
Structure of the queries (with no answer) obtained from the logs
| Composition | Number |
|---|---|
| 1 word | 1,061 |
| 2 words | 1,636 |
| 3 words | 1,443 |
| 4 (and more) words | 2,157 |
Numbers of proposed corrections with the Levenshtein edit distance at different thresholds
| Thresholds | < 0.05 | < 0.1 | < 0.15 | < 0.2 | < 0.25 | < 0.3 | < 0.35 | < 0.4 | < 0.45 | < 0.5 | < 0.6 | < 0.7 | < 0.8 | < 0.9 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14 | 73 | 118 | 176 | 273 | 549 | 1,187 | 2,265 | 4,707 | 8,448 | 59,844 | 656,291 | 5,368,088 | 13,695,608 | |
| 0.08 | 0.44 | 0.72 | 1.07 | 1.67 | 3.36 | 7.28 | 13.89 | 28.87 | 51.83 | 367.14 | 4,026.32 | 32,933 | 84,022 | |
Numbers of proposed corrections with the Stoilos function at different thresholds
| Thresholds | > 0.1 | > 0.2 | > 0.3 | > 0.4 | > 0.5 | > 0.6 | > 0.7 | > 0.8 | > 0.9 |
|---|---|---|---|---|---|---|---|---|---|
| 42,721 | 23,658 | 12,748 | 6,884 | 3,490 | 1,636 | 703 | 305 | 119 | |
| 262.09 | 145.14 | 78.2 | 42.23 | 21.41 | 10.03 | 4.31 | 1.87 | 0.73 | |
Numbers of proposed corrections (between brackets the number by query) at different thresholds with the Stoilos function combined with the Levenshtein edit distance
| Levenshtein | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| < 0.05 | < 0.1 | < 0.15 | < 0.2 | < 0.3 | < 0.4 | < 0.5 | < 0.6 | < 0.7 | < 08 | < 0.9 | ||
| 6 (0.03) | 63 (0.38) | 107 (0.65) | 165 (1.01) | 538 (3.30) | 2,188 (13.42) | 6,563 (40.20) | 18,274 (112.11) | 30,303 (185.90) | 39,456 (242.06) | 42,483 (260.63) | ||
| 6 (0.03) | 63 (0.38) | 107 (0.65) | 165 (1.01) | 537 (3.29) | 2,118 (12.99) | 5,806 (35.61) | 13,053 (80.79) | 18,790 (115.27) | 22,395 (137.39) | 23,576 (144.63) | ||
| 6 (0.03) | 63 (0.38) | 107 (0.65) | 165 (1.01) | 534 (3.27) | 1,990 (12.20) | 4,680 (28.71) | 8,352 (51.23) | 10,909 (66.92) | 12,328 (75.63) | 12,709 (77.96) | ||
| 6 (0.03) | 63 (0.38) | 107 (0.65) | 165 (1.01) | 526 (3.22) | 1,789 (10.97) | 3,548 (21.76) | 5,262 (32.28) | 6,236 (38.25) | 6,749 (41.40) | 6,864 (42.11) | ||
| 6 (0.03) | 63 (0.38) | 107 (0.65) | 164 (1.00) | 492 (4.92) | 1,397 (8.57) | 2,313 (14.19) | 2,910 (17.85) | 3,268 (20.04) | 3,435 (21.07) | 3,478 (21.33) | ||
| 6 (0.03 | 63 (0.38) | 107 (0.65) | 162 (0.99) | 431 (2.64) | 864 (5.30) | 1,199 (7.35) | 1,431 (8.77) | 1,562 (9.58) | 1,617 (9.92) | 1,625 (9.96) | ||
| 6 (0.03) | 63 (0.38) | 106 (0.65) | 160 (0.98) | 292 (1.79) | 448 (2.74) | 556 (3.41) | 653 (4.0) | 685 (4.20) | 690 (4.23) | 692 (4.24) | ||
| 6 (0.03) | 62 (0.38) | 97 (0.59) | 138 (0.84) | 182 (1.11) | 231 (1.41) | 275 (1.68) | 288 (1.76) | 290 (1.77) | 293 (1.79) | 294 (1.80) | ||
| 6 (0.03) | 52 (0.31) | 79 (0.48) | 95 (0.58) | 103 (0.63) | 105 (0.64) | 106 (0.65) | 106 (0.65) | 106 (0.65) | 108 (0.66) | 108 (0.66) | ||
Figure 1Total number of suggestions according to different thresholds of Levenshtein and Stoilos.
Evaluations and numbers of corrected queries for Levenshtein edit distance with different thresholds
| Threshold | < 0.05 | < 0.1 | < 0.15 | < 0.2 | < 0.25 | < 0.3 | < 0.35 | < 0.4 | < 0.45 | < 0.5 | < 0.6 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 14 | 73 | 118 | 176 | 273 | 549 | 1,187 | 2,265 | 4,707 | 8,448 | 59,844 | |
| 14 | 71 | 105 | 126 | 137 | 141 | 148 | 154 | 157 | 162 | 163 | |
| 100 | 100 | 99.04 | 97.61 | 95.62 | 95.03 | 91.89 | 91.55 | 89.80 | 87.03 | 86.50 | |
| 08.58 | 43.55 | 63.80 | 75.46 | 80.36 | 82.20 | 83.43 | 86.50 | 86.50 | 86.50 | 86.50 | |
| 15.81 | 60.68 | 77.61 | 85.12 | 87.33 | 88.15 | 87.45 | 88.95 | 88.12 | 86.76 | 86.50 | |
Evaluations and numbers of corrected queries for Stoilos function with different thresholds
| Threshold | > 0.9 | > 0.8 | > 0.7 | > 0.6 | > 0.5 | > 0.4 | > 0.3 | > 0.2 | > 0.1 |
|---|---|---|---|---|---|---|---|---|---|
| 119 | 305 | 705 | 1,636 | 3,490 | 6,884 | 12,748 | 23,659 | 42,721 | |
| 90 | 128 | 143 | 148 | 157 | 162 | 163 | 163 | 163 | |
| 97.77 | 84.37 | 90.20 | 89.86 | 86.62 | 86.41 | 85.88 | 85.88 | 85.88 | |
| 53.98 | 66.25 | 79.14 | 81.59 | 83.43 | 85.88 | 85.88 | 85.88 | 85.88 | |
| 69.56 | 74.22 | 84.31 | 85.55 | 85.00 | 86.15 | 85.88 | 85.88 | 85.88 | |
Figure 2Precision (P) and recall (R) curves according to different thresholds of Levenshtein (Lev) and Stoilos (Sto).
Evaluation (P: Precision, R: Recall, F: F-Measure) and number of corrected queries (Q) with Levenshtein and Stoilos combinations
| Levenshtein | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| < 0.05 | < 0.1 | < 0.15 | < 0.2 | < 0.3 | < 0.4 | < 0.5 | < 0.6 | < 0.7 | < 0.8 | < 0.9 | ||
| Q:50 | Q:74 | Q:83 | Q:84 | |||||||||
| Q:89 | Q:109 | Q:110 | Q:114 | Q:115 | ||||||||
| Q: 6 | Q:119 | Q:123 | Q:130 | |||||||||
| Q:59 | Q:97 | Q:121 | Q:127 | Q:130 | ||||||||
| Q:129 | Not evaluated | |||||||||||
| Q:122 | Q:130 | |||||||||||
| P = 94.26 | P = 83.84 | |||||||||||
| R = 70.55 | R = 66.87 | |||||||||||
| F = 80.70 | F = 74.75 | |||||||||||
Figure 3Precision curves according to different thresholds of Levenshtein combined with Stoilos (Sto) with different thresholds.
Figure 4Recall curves: Levenshtein combined with Stoilos.
Figure 5Times according to the size of the queries with Lev < 0.2 and Sto > 0.7.
Figure 6Total number of suggestions according to the size of the query.
Number of suggestions according to the size of the query
| Nb characters | Nb suggestions by query | |
|---|---|---|
| Min = 3; Avg = 10.49; Max = 25 | Avg = 0.39; Max = 5 | |
| Min = 5; Avg = 18.36; Max = 41 | Avg = 0.22; Max = 6 | |
| Min = 10; Avg = 24.39; Max = 54 | Avg = 0.13; Max = 1 | |
| Min = 11; Avg = 37.30; Max = 113 | Avg = 0.06; Max = 1 | |
Evaluation measures of the different methods : Bag-of-Words (BoW), Levenshtein along with Stoilos (LS), LS performed before BoW, and BoW performed before Levenshtein combined with Stoilos
| 1 word | 2 words | 3 words | 4 words + | Total | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P(%) | R(%) | F(%) | P(%) | R(%) | F(%) | P(%) | R(%) | F(%) | P(%) | R(%) | F(%) | P(%) | R(%) | F(%) | |
| 26.85 | 42.33 | 34.81 | 51.64 | 44.06 | 61.17 | 38.16 | 55.24 | 35.88 | 52.81 | ||||||
| [100-100] | [19.73-33.96] | [32.96-50.70] | [100-100] | [27.38-42.24] | [42.99-59.39] | [100-100] | [35.92-52.19] | [52.85-68.59] | [100-100] | [30.44-45.88] | [46.67-62.90] | [100-100] | [32.05-39.71 | [48.54-56.85] | |
| 92.11 | 46.98 | 62.22 | 82.61 | 36.08 | 50.22 | 51.56 | 23.08 | 31.88 | 46.77 | 11.18 | 18.05 | 69.74 | 29.40 | 41.37 | |
| [86.04-98.17] | [38.97-54.99] | [53.64-70.49] | [73.67-91.55] | [28.59-43.56] | [40.76-59.03] | [39.32-63.81] | [16.17-29.98] | [22.92-40.79] | [34.35-59.19] | [6.17-16.19] | [10.46-25.43] | [64.27-75.21] | [25.76-33.04] | [36.78-45.91] | |
| 93.10 | 54.36 | 68.64 | 83.78 | 39.24 | 53.45 | 58.67 | 27.97 | 37.88 | 51.47 | 12.50 | 20.1 | 73.03 | 30.40 | 42.93 | |
| [87.78-98.43 | [46.36-62.36] | [60.68-76.35] | [75.39-92.18] | [31.63-46.85] | [44.56-62.13] | [47.52-69.81] | [20.62-35.33] | [28.76-46.92] | [39.59-63.35] | [7.24-17.76] | [12.24-27.74] | [68.04-78.02] | [26.72-34.07] | [38.37-47.43] | |
| 86.67 | 84.96 | 65.65 | 72.92 | 77.08 | |||||||||||
| [80.16-93.17] | [53.24-68.9] | [63.98-79.22] | [78.36-91.55 | [53.15-68.37] | [63.34-78.28] | [57.52-73.78] | [52.11-68.16] | [54.68-70.86] | [64.03-81.81] | [38.13-53.98] | [47.80-65.04] | [73.17-80.98] | [51.01-58.96] | [60.11-68.24] | |
Figure 7Proportion of matched queries according to the method and the size of the query : Bag-of-Words (BoW), Levenshtein alongside Stoilos (LS) and BoW with LS.
Figure 8Precision curves according to the size of the query.
Figure 9Recall curves according to the size of the query.
Figure 10F-Measure curves according to the size of the query.