| Literature DB >> 19718447 |
Thomas Ostermann1, Christa K Raak, Peter F Matthiessen, Arndt Büssing, Hartmut Zillmann.
Abstract
Complementary and alternative therapies and medicines (CAM) such as acupuncture or mistletoe treatment are much asked for by cancer patients. With a growing interest in such therapies, physicians need a simple tool with which to get an overview of the scientific publications on CAM, particularly those that are not listed in common bibliographic databases like MEDLINE. CAMbase is an XML-based bibliographical database on CAM which serves to address this need. A custom front end search engine performs semantic analysis of textual input enabling users to quickly find information relevant to the search queries. This article describes the technical background and the architecture behind CAMbase, a free online database on CAM (www.cambase.de). We give examples on its use, describe the underlying algorithms and present recent statistics for search terms related to complementary therapies in oncology.Entities:
Keywords: cancer; complementary medicine; database; literature; semantic web
Year: 2009 PMID: 19718447 PMCID: PMC2730176 DOI: 10.4137/cin.s1182
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1Relative growth of bibliographical datasets in various online-sources of information on the search query term “cancer” from 1996 to 2006. Absolute number of hits for “cancer” in 1996: Pubmed: 56070; CAM-Subset of Pubmed: 2611; EMBase: 290; Cinahl: 451; AMED: 208.
Figure 2Linguistic processing of natural language queries: Subject Search of “European mistletoe for the treatment of cancer” leads to different results than “mistletoe treatment of cancer in Europe.”
Description and Examples of NLP-techniques applied in CAMbase.
| NLP-Technique | Description | Examples |
|---|---|---|
| • Flexions | Change of word form which does not change the part-of-speech category, such as conjugation | Treat/-ing/-ed; assess/-ment |
| • Compound words | Rule based decomposition of a word into its base forms; base form compounding | Krebstherapie (engl.: cancer therapy) is decomposed into “Krebs” and “Therapie”; “Lektine” (engl.: lectins) also finds “Mistellektine” (engl.: “mistletoe lectins”) |
| • Irregular plurals | Rule based analysis of German | “Fragebogen” (engl.: questionnaire) finds “Fragebögen” and vice versa |
| • Stemming | Rule based stripping off affixes to get the stem (root) of the word | German: Chemotherap/-ie/-n; English: filter/-s |
| • Subject-Object relations | Grammatical dissolution and analysis of the search phrase | See |
| • Punctuation | Splitting the character sequence at white space positions. | |
| • Normalisation | Analysis of capitalised words. | MIT (Massachusetts Institute of Technology) is different from the German “mit” (engl: “with”) |
| • Analysing word frequencies and collocations | Comparing different uses and occurrences of the same word/stem | Thematic landscapes; Tag clouds (see |
Figure 3Behaviour of the search algorithms for the threshold values Q = 35 and Q = 45 and high (+) versus low (−) sensitivity. The quality index Q(s) −Q is plotted against the number of hits and produces different survival curves for the datasets searched with the query “adverse effects of mistletoe preparations” depending on the setting of the parameters.
Figure 4Schematic description of an inverted file structure of a bibliographical dataset with linguistic processing information in the front and additional discriminant information (ADIs) in the back of the coding. “*stem$mr” denotes the decomposition of a search term i.e. “oncology” leads to “oncolog$~y”. Note that front truncation information (marked with a “*”) and umlauting are features specially designed for German language (i.e. for composition terms like “Krebstherapie” (engl. “therapy of cancer”).
Figure 5Illustration of seeking time t behaviour in relation to the number of search terms for search queries processed by conventional search algorithms without IFS (no IFS), conventional search algorithms operating on inverted file structures (IFS) and special algorithms given in CAMbase operating on inverted file structures (IFS+). Whilst conventional structures without IFS increase exponentially in their seeking time, IFS has a more or less logistic curve, and IFS+ decreases in its searching time the more search terms are included in the query.
Figure 6Search tree and respective screen shots of CAM-landscapes on the search query “Krebs” (engl.: cancer).