| Literature DB >> 19192280 |
Abstract
BACKGROUND: With the growing availability of full-text articles online, scientists and other consumers of the life sciences literature now have the ability to go beyond searching bibliographic records (title, abstract, metadata) to directly access full-text content. Motivated by this emerging trend, I posed the following question: is searching full text more effective than searching abstracts? This question is answered by comparing text retrieval algorithms on MEDLINE abstracts, full-text articles, and spans (paragraphs) within full-text articles using data from the TREC 2007 genomics track evaluation. Two retrieval models are examined: bm25 and the ranking algorithm implemented in the open-source Lucene search engine.Entities:
Mesh:
Year: 2009 PMID: 19192280 PMCID: PMC2695361 DOI: 10.1186/1471-2105-10-46
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Effectiveness of bm25 and the Lucene ranking algorithm on abstracts, full-text articles, and spans from full text.
| Ivory ( | Ivory ( | |
| Abstract | 0.163 | 0.129 |
| Article | 0.146 (-11%)° | 0.235 (+82%)** |
| Span (max) | 0.240 (+47%)** | 0.206 (+60%)** |
| Span (sum) | 0.192 (+18%)* | 0.198 (+54%)** |
| Ivory ( | Ivory ( | |
| Abstract | 0.322 | 0.293 |
| Article | 0.158 (-51%)** | 0.353 (+20%)* |
| Span (max) | 0.357 (+11%)° | 0.332 (+13%)° |
| Span (sum) | 0.314 (-3%)° | 0.317 (+8%)* |
| Ivory ( | Ivory ( | |
| Abstract | 0.110 | 0.090 |
| Article | 0.163 (+48%)° | 0.222 (+146%)** |
| Span (max) | 0.212 (+93%)** | 0.189 (+109%)** |
| Span (sum) | 0.149 (+36%)* | 0.159 (+77%)** |
For all metrics, relative improvements over baseline are shown; ** = statistically significant (p < 0.01); * = statistically significant (p < 0.05); ° = not significant.
Results of significance testing comparing article retrieval with span retrieval ("max" strategy).
| Ivory ( | Ivory ( | |
| MAP | ||
| P20 | ||
| IP@R50 |
Effectiveness of bm25 and the Lucene ranking algorithm combining evidence from spans with evidence from abstracts and articles.
| Ivory ( | Ivory ( | |
| Span (max) | 0.240 | 0.206 |
| Span (max) + Abstract | 0.257 (+7%)° | 0.216 (+5%)° |
| Span (max) + Article | 0.257 (+7%)° | 0.262 (+27%)** |
| Ivory ( | Ivory ( | |
| Span (max) | 0.357 | 0.332 |
| Span (max) + Abstract | 0.382 (+7%)° | 0.349 (+5%)° |
| Span (max) + Article | 0.343 (-4%)° | 0.404 (+22%)** |
| Ivory ( | Ivory ( | |
| Span (max) | 0.212 | 0.189 |
| Span (max) + Abstract | 0.215 (+1%)° | 0.190 (+1%)° |
| Span (max) + Article | 0.257 (+21%)° | 0.244 (+29%)** |
For all metrics, relative improvements over baseline are shown; ** = statistically significant (p < 0.01); * = statistically significant (p < 0.05); ° = not significant.
Comparison of different experimental conditions for bm25 and the Lucene ranking algorithm.
| Model | Metric | Comparison |
| MAP | Span (max) + Article, Span (max) >> Abstract, Article | |
| P20 | Span (max) + Article, Span (max), Abstract >> Article | |
| IP@R50 | Span (max) + Article >> Abstract, Article; Span (max) >> Abstract | |
| Lucene | MAP | Span (max) + Article >> Span (max), Article >> Abstract |
| P20 | Span (max) + Article > Span (max), Article; Article > Abstract | |
| IP@R50 | Span (max) + Article > Span (max), Article >> Abstract | |
A >> B indicates that A is significantly better than B (p < 0.01); A > B indicates that A is significantly better than B (p < 0.05);
Time required for index construction, comparing Lucene to different Ivory configurations.
| Lucene (1 core) | Ivory (10 cores) | Ivory (20 cores) | |
| Abstract | 1 h 00 m 58 s | 1 m 32 s | 1 m 07 s |
| Article | 19 h 09 m 23 s | 17 m 21 s | 9 m 57 s |
| Span | 27 h 10 m 46 s | 39 m 58 s | 24 m 56 s |
Time required for retrieval runs, comparing Lucene to different Ivory configurations.
| Lucene (1 core) | Ivory (10 cores) | Ivory (20 cores) | |
| Abstract (1000 hits) | 1 m 42 s | 51 s | 40 s |
| Article (1000 hits) | 7 m 00 s | 1 m 51 s | 1 m 09 s |
| Span (5000 hits) | 21 m 32 s | 11 m 57 s | 8 m 25 s |
Sample topics from the TREC 2007 genomics track.
| 200 | What serum [PROTEINS] change expression in association with high disease activity in lupus? |
| 201 | What [MUTATIONS] in the Raf gene are associated with cancer? |
| 202 | What [DRUGS] are associated with lysosomal abnormalities in the nervous system? |
| 203 | What [CELL OR TISSUE TYPES] express receptor binding sites for vasoactive intestinal peptide (VIP) on their cell surface? |
| 204 | What nervous system [CELL OR TISSUE TYPES] synthesize neurosteroids in the brain? |
| 205 | What [SIGNS OR SYMPTOMS] of anxiety disorder are related to coronary artery disease? |
Figure 1Illustration of the MapReduce framework: the "mapper" is applied to all input records, which generates results that are aggregated by the "reducer". The runtime groups together values by keys.
Figure 2Pseudo-code of Ivory's indexing algorithm in MapReduce. The mapper processes each document and emits postings with the associated term as the key. The reducer gathers all postings for each term to create the inverted index.
Figure 3Pseudo-code of Ivory's retrieval algorithm in MapReduce. The mapper processes the postings lists in parallel. For each query term, the mapper initializes accumulators to hold partial score contributions from all documents containing the term. The reducer adds up partial scores to produce the final results.