| Literature DB >> 24723793 |
A R Rivas1, E L Iglesias1, L Borrajo1.
Abstract
Information Retrieval focuses on finding documents whose content matches with a user query from a large document collection. As formulating well-designed queries is difficult for most users, it is necessary to use query expansion to retrieve relevant information. Query expansion techniques are widely applied for improving the efficiency of the textual information retrieval systems. These techniques help to overcome vocabulary mismatch issues by expanding the original query with additional relevant terms and reweighting the terms in the expanded query. In this paper, different text preprocessing and query expansion approaches are combined to improve the documents initially retrieved by a query in a scientific documental database. A corpus belonging to MEDLINE, called Cystic Fibrosis, is used as a knowledge source. Experimental results show that the proposed combinations of techniques greatly enhance the efficiency obtained by traditional queries.Entities:
Mesh:
Year: 2014 PMID: 24723793 PMCID: PMC3958669 DOI: 10.1155/2014/132158
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Figure 1The information retrieval process.
A sample of a MEDLINE document.
| TI | The occurrence of Cystic Fibrosis and celiac sprue within a single sibship. |
| MJ | CYSTIC-FIBROSIS: fg. CELIAC-DISEASE: fg. |
| MN | ADULT. BIOPSY. CELIAC-DISEASE: co, fg. CHILD. CYSTIC-FIBROSIS: co. |
| DIET-THERAPY. DILATATION. FEMALE. FLOCCULATION. GLUTEN: me. | |
| HUMAN. INTESTINAL-MUCOSA: pa, ra. INTESTINE-SMALL: pa, ra. | |
| JEJUNUM: pa. MALE. PEDIGREE. CELIAC-DISEASE: co, th. | |
| AB | A family is presented in which celiac sprue and cystic fibrosis occurred within the |
| same sibship. A cousin of the index case was also discovered to have celiac sprue. | |
| The genetics and incidence of both conditions are reviewed. It is estimated that | |
| the likelihood of this association occurring on the basis of chance in this is 1 in |
A sample of query with its relevant documents and relevance scores.
| QU | What is the association between liver disease (cirrhosis) and vitamin A metabolism in CF? |
|
| |
| RD | 165 1122 174 0001 362 0001 370 0001 414 2222 443 0100 794 2110 992 1010 1040 0001 1115 0102 |
Correspondence between parameters of the BM25 weighting and Okapi TF.
| Okapi BM25 | Okapi TF | |
|---|---|---|
| tf |
|
|
| tf |
|
|
| tf |
|
|
Accuracy of query search using different stemming functions and stopword lists in Cystic Fibrosis. Evaluation measures used are MAP (mean average precision), R-prec (R precision), and D (number of relevant documents retrieved).
| Combinations | Measures | ||
|---|---|---|---|
| MAP |
|
| |
| Baseline | 0.1545 | 0.2098 | 683 |
| Porter stemmer | 0.1663 | 0.2154 | 747 |
| Krovetz stemmer | 0.1663 | 0.2231 | 740 |
| NLM stopword list | 0.1681 | 0.2242 | 723 |
| SMART stopword list | 0.1695 | 0.2243 | 728 |
| Porter stemmer-NLM stopwords |
|
|
|
| Porter stemmer-SMART stopwords |
|
|
|
| Krovetz stemmer-NLM stopwords |
|
|
|
| Krovetz stemmer-SMART stopwords |
|
|
|
The bold font refers to the best values for the parameters.
MAP values for the TF-IDF BM25, Raw TF, and logTF formulas.
| Combinations | Parameters | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| Raw TF | logTF | |
| 1.2 | 0.75 | 1000 | 1.3 | 0.6 | 1.2 | 1.2 | 0.7 | 1.2 | |||
| Porter stemmer-NLM stopwords | 0.1861 |
| 0.1861 | 0.1422 | 0.1749 | ||||||
| Porter stemmer-SMART stopwords | 0.1866 |
| 0.1878 | 0.1445 | 0.1742 | ||||||
| Krovetz stemmer-NLM stopwords | 0.1821 |
| 0.1827 | 0.1420 | 0.1736 | ||||||
| Krovetz stemmer-SMART stopwords | 0.1828 |
| 0.1839 | 0.1436 | 0.1731 | ||||||
The bold font refers to the best values for the parameters.
MAP values to retrieving in Abstract, Title, and MeSH fields using different weighting algorithms.
| Combinations | Algorithms | |||
|---|---|---|---|---|
| BM25 | TF-IDF BM25 | logTF | Raw TF | |
| Porter stemmer-NLM stopwords | 0.2717 |
| 0.2683 | 0.2209 |
| Porter stemmer-SMART stopwords | 0.2733 |
| 0.2665 | 0.2221 |
| Krovetz stemmer-NLM stopwords | 0.2719 |
| 0.2684 | 0.2208 |
| Krovetz stemmer-SMART stopwords | 0.2737 |
| 0.2654 | 0.2228 |
The bold font refers to the best values for the parameters.
Evaluation measures using the pseudorelevance feedback in Abstract, Title, and MeSH fields.
| Combinations | Measures | |
|---|---|---|
| MAP |
| |
| Porter stemmer-NLM stopwords | 0.3468 | 0.3780 |
| Porter stemmer-SMART stopwords | 0.3391 | 0.3731 |
| Krovetz stemmer-NLM stopwords | 0.3475 | 0.3834 |
| Krovetz stemmer-SMART stopwords | 0.3435 | 0.3790 |
Evaluation measures using query expansion with descriptors, applied in MeSH, Title, and Abstract fields.
| Combinations | Measures | |
|---|---|---|
| MAP |
| |
| Porter stemmer-NLM stopwords | 0.3538 | 0.3791 |
| Porter stemmer-SMART stopwords | 0.3489 | 0.3766 |
| Krovetz stemmer-NLM stopwords | 0.3465 | 0.3750 |
| Krovetz stemmer-SMART stopwords | 0.3424 | 0.3732 |
Figure 2Recall-Precision curve obtained with the query expansion methods using the MeSH, Abstract, and Title fields.
(a) The best value obtained for the k 1 parameter
| Combinations | Parameters | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| 0 | 0.75 | 1.2 | 1 | 0.75 | 1.2 | 2 | 0.75 | 1.2 | 1.5 | 0.75 | 1.2 |
| 0.75 | 1.2 | 1.2 | 0.75 | 1.2 | 1.4 | 0.75 | 1.2 | |
| Porter stemmer-NLM stopwords | 0.1521 | 0.1803 | 0.1792 | 0.1792 |
| 0.1798 | 0.1792 | ||||||||||||||
| Porter stemmer-SMART stopwords | 0.1531 | 0.1820 | 0.1783 | 0.1788 |
| 0.1815 | 0.1807 | ||||||||||||||
| Krovetz stemmer-NLM stopwords | 0.1521 | 0.1797 | 0.1778 | 0.1802 |
| 0.1806 | 0.1799 | ||||||||||||||
| Krovetz stemmer-SMART stopwords | 0.1557 | 0.1828 | 0.1793 | 0.1801 |
| 0.1815 | 0.1819 | ||||||||||||||
The bold font refers to the best values for the parameters.
(b) The best value obtained for the b parameter
| Combinations | Parameters | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| 1.3 | 0 | 1.2 | 1.3 | 1 | 1.2 | 1.3 | 0.75 | 1.2 | 1.3 | 0.65 | 1.2 | 1.3 |
| 1.2 | 1.3 | 0.55 | 1.2 | 1.3 | 0.70 | 1.2 | |
| Porter stemmer-NLM stopwords | 0.1652 | 0.1715 | 0.1807 | 0.1810 |
| 0.1821 | 0.1799 | ||||||||||||||
| Porter stemmer-SMART stopwords | 0.1669 | 0.1695 | 0.1813 | 0.1816 |
| 0.1808 | 0.1817 | ||||||||||||||
| Krovetz stemmer-NLM stopwords | 0.1667 | 0.1706 | 0.1804 | 0.1802 |
| 0.1819 | 0.1795 | ||||||||||||||
| Krovetz stemmer-SMART stopwords | 0.1701 | 0.1707 | 0.1824 | 0.1823 |
| 0.1821 | 0.1818 | ||||||||||||||
The bold font refers to the best values for the parameters.
(c) The best value obtained for the k 3 parameter.
| Combinations | Parameters | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| 1.3 | 0.6 | 0 | 1.3 | 0.6 | 1 | 1.3 | 0.6 | 2 | 1.3 | 0.6 | 7 | 1.3 | 0.6 | 1.5 | 1.3 | 0.6 |
| 1.3 | 0.6 | 1.3 | |
| Porter stemmer-NLM stopwords | 0.1824 | 0.1825 | 0.1824 | 0.1817 | 0.1823 |
| 0.1822 | ||||||||||||||
| Porter stemmer-SMART stopwords | 0.1814 | 0.1813 | 0.1813 | 0.1810 | 0.1814 |
| 0.1814 | ||||||||||||||
| Krovetz stemmer-NLM stopwords | 0.1811 | 0.1819 | 0.1825 | 0.1815 | 0.1822 |
| 0.1819 | ||||||||||||||
| Krovetz stemmer-SMART stopwords | 0.1817 | 0.1822 | 0.1820 | 0.1815 | 0.1824 |
| 0.1824 | ||||||||||||||
The bold font refers to the best values for the parameters.
(a) The best value for the M parameter
| Combinations | Parameters | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
| Baseline | |
|
| 10 | 0.5 | 5 | 10 | 0.5 | 15 | 10 | 0.5 | 30 | 10 | 0.5 | ||
| Porter stemmer-NLM stopwords |
| 0.1991 | 0.2070 | 0.2053 | 0.1868 | ||||||||
| Porter stemmer-SMART stopwords |
| 0.2026 | 0.2053 | 0.2052 | 0.1898 | ||||||||
| Krovetz stemmer-NLM stopwords |
| 0.1974 | 0.2007 | 0.2025 | 0.1843 | ||||||||
| Krovetz stemmer-SMART stopwords |
| 0.1998 | 0.2022 | 0.2056 | 0.1866 | ||||||||
The bold font refers to the best values for the parameters.
(b) The best value for the K parameter
| Combinations | Parameters | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| 10 | 10 | 0.5 | 10 | 20 | 0.5 | 10 | 30 | 0.5 | 10 | 40 | 0.5 | 10 |
| 0.5 | |
| Porter stemmer-NLM stopwords | 0.2079 | 0.2103 | 0.2117 | 0.2102 |
| ||||||||||
| Porter stemmer-SMART stopwords | 0.2075 | 0.2083 | 0.2105 | 0.2087 |
| ||||||||||
| Krovetz stemmer-NLM stopwords | 0.2094 | 0.2068 | 0.2075 | 0.2068 |
| ||||||||||
| Krovetz stemmer-SMART stopwords | 0.2074 | 0.2061 | 0.2015 | 0.2032 |
| ||||||||||
The bold font refers to the best values for the parameters.
(c) The best value for the α parameter
| Combinations | Parameters | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| 10 | 28 |
| 10 | 28 | 0.1 | 10 | 28 | 1 | 10 | 28 | 0.9 | 10 | 28 | 0.4 | |
| Porter stemmer-NLM stopwords |
| 0.1957 | 0.2103 | 0.2099 | 0.2105 | ||||||||||
| Porter stemmer-SMART stopwords |
| 0.1956 | 0.2035 | 0.2091 | 0.2093 | ||||||||||
| Krovetz stemmer-NLM stopwords |
| 0.1904 | 0.2072 | 0.2097 | 0.2089 | ||||||||||
| Krovetz stemmer-SMART stopwords |
| 0.1932 | 0.2024 | 0.2031 | 0.2061 | ||||||||||
The bold font refers to the best values for the parameters.