| Literature DB >> 14728443 |
Shamim Ara Mollah1, Stephen B Johnson.
Abstract
Conversion of free-text strings in a natural language to a standard representation (codes) is an important reoccurring problem in biomedical informatics. Determining the content of a string involves identifying its meaningful constituents (morphemes). One current method of identifying these constituents is to look them up in a preexisting table (lexicon). Manual construction of lexicons and grammars in complex domains such as biomedicine is extremely laborious. As an alternative to the lexico-grammatical approach, we introduce a segmentation algorithm that automatically learns lexical and structural preferences from corpora via information compression. The method is based on the Minimum Description Length (MDL) principle from classic information theory.Entities:
Mesh:
Year: 2003 PMID: 14728443 PMCID: PMC1480252
Source DB: PubMed Journal: AMIA Annu Symp Proc ISSN: 1559-4076