| Literature DB >> 12460633 |
P Ruch1, R Baud, A Geissbühler.
Abstract
Unlike journal corpora, which are supposed to be carefully reviewed before being published, the quality of documents in a patient record are often corrupted by mispelled words and conventional graphies or abbreviations. After a survey of the domain, the paper focuses on evaluating the effect of such corruption on an information retrieval (IR) engine. The IR system uses a classical bag of words approach, with stems as representation items and term frequency-inverse document frequency (tf-idf) as weighting schema; we pay special attention to the normalization factor. First results shows that even low corruption levels (3%) do affect retrieval effectiveness (4-7%), whereas higher corruption levels can affect retrieval effectiveness by 25%. Then, we show that the use of an improved automatic spelling correction system, applied on the corrupted collection, can almost restore the retrieval effectiveness of the engine.Entities:
Mesh:
Year: 2002 PMID: 12460633 DOI: 10.1016/s1386-5056(02)00057-6
Source DB: PubMed Journal: Int J Med Inform ISSN: 1386-5056 Impact factor: 4.046