| Literature DB >> 18312673 |
Abstract
BACKGROUND: The evaluation of information retrieval techniques has traditionally relied on human judges to determine which documents are relevant to a query and which are not. This protocol is used in the Text Retrieval Evaluation Conference (TREC), organized annually for the past 15 years, to support the unbiased evaluation of novel information retrieval approaches. The TREC Genomics Track has recently been introduced to measure the performance of information retrieval for biomedical applications.Entities:
Mesh:
Year: 2008 PMID: 18312673 PMCID: PMC2292696 DOI: 10.1186/1471-2105-9-132
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overview of the NT Evaluation protocols. The three panels introduce notation used throughout the manuscript. Each protocol randomly samples documents from the text collection to produce information requests/topics. See text for a description of each protocol.
Mean Reciprocal Rank for 1,000 queries (focused search evaluation) measured for 13 search methods.
| 2.028 | 0.493 | BM25ec | Porter | 0 |
| 2.054 | 0.487 | BM25ec | Paice-Husk | 0 |
| 1.787 | 0.560 | BM25ec | None | 0 |
| 1.750 | 0.571 | BM25ec | None | 20 |
| 1.724 | 0.580 | BM25ec | None | 40 |
| 1.732 | 0.577 | BM25ec | None | 60 |
| 1.732 | 0.577 | BM25ec | None | 80 |
| 1.728 | 0.579 | BM25ec | None | 100 |
| 1.737 | 0.576 | BM25ec | None | 120 |
| 1.752 | 0.571 | BM25ec | None | 140 |
| 1.755 | 0.570 | BM25ec | None | 160 |
| 1.760 | 0.568 | BM25ec | None | 180 |
| 1.767 | 0.566 | BM25ec | None | 200 |
Figure 2MAP and bpref performance measures obtained by NT Evaluation and TREC evaluation. The scatter plots compare the performance of methods measured in the NT Evaluation protocol and with TREC relevance judgments (left four plots), or compare agreement between two independent TREC Genomics Track evaluation (rightmost plots). Pearson correlation coefficients are shown in each scatter plot (values in parentheses are Spearman rank correlation coefficients). Better correlations are observed when bpref measures are compared (top row of scatter plots) vs. MAP measures (bottom row).
Correlation coefficients for data in Figure 1. Pearson's coefficients are shown followed by Spearmans' rank coefficients in parentheses.
| 1.0000 | 0.9291 (0.9780) | 0.8416 (0.8057) | ||
| 1.0000 | 0.9560 (0.6575) | | ||
| 1.0000 | 0.9373 (0.8307) | |||
| 1.0000 |
Search methods Scompared with the high-recall NT Evaluation protocol.
| 0 | 3 | INTER_MATCH_DISTANCE_SCORER | DisjunctiveQueryDistributor | 8 | no | N/A | N/A |
| 0 | 4 | INTER_MATCH_DISTANCE_SCORER | ConjunctiveDisjunctiveQueryDistributor | 16 | no | N/A | N/A |
| 0 | 5 | bm25ec | DisjunctiveQueryDistributor | 8 | no | N/A | N/A |
| 0 | 6 | bm25ec | ConjunctiveDisjunctiveQueryDistributor | 16 | no | N/A | N/A |
| 0 | 7 | BM25EC2_IMD_SCORER | DisjunctiveQueryDistributor | 8 | no | N/A | N/A |
| 0 | 8 | BM25EC2_IMD_SCORER | ConjunctiveDisjunctiveQueryDistributor | 16 | no | N/A | N/A |
| 0 | 9 | bm25ec | ConjunctiveDisjunctiveQueryDistributor | 8 | no | N/A | N/A |
| 0 | 10 | INTER_MATCH_DISTANCE_SCORER(1,-1) | DisjunctiveQueryDistributor | 8 | no | N/A | N/A |
| 0 | 11 | INTER_MATCH_DISTANCE_SCORER(-1,1) | DisjunctiveQueryDistributor | 8 | no | N/A | N/A |
| 0 | 14 | INTER_MATCH_DISTANCE_SCORER(-3,1) | DisjunctiveQueryDistributor | 8 | no | N/A | N/A |
| 0 | 15 | INTER_MATCH_DISTANCE_SCORER(-2,1) | DisjunctiveQueryDistributor | 8 | no | N/A | N/A |
| 20 | 20 | BM25EC2_IMD_SCORER | ConjunctiveDisjunctiveQueryDistributor | 8 | no | N/A | N/A |
| 80 | 21 | BM25EC2_IMD_SCORER | ConjunctiveDisjunctiveQueryDistributor | 16 | no | N/A | N/A |
| 160 | 22 | BM25EC2_IMD_SCORER | ConjunctiveDisjunctiveQueryDistributor | 16 | no | N/A | N/A |
| 200 | 23 | BM25EC2_IMD_SCORER | ConjunctiveDisjunctiveQueryDistributor | 16 | no | N/A | N/A |
| 20 | 40 | bm25ec | ConjunctiveDisjunctiveQueryDistributor | 16 | no | N/A | N/A |
| 80 | 41 | bm25ec | ConjunctiveDisjunctiveQueryDistributor | 16 | no | N/A | N/A |
| 160 | 42 | bm25ec | ConjunctiveDisjunctiveQueryDistributor | 16 | no | N/A | N/A |
| 200 | 43 | bm25ec | ConjunctiveDisjunctiveQueryDistributor | 16 | no | N/A | N/A |
| 20 | 50 | bm25ec | DisjunctiveQueryDistributor | 16 | no | N/A | N/A |
| 80 | 51 | bm25ec | DisjunctiveQueryDistributor | 16 | no | N/A | N/A |
| 160 | 52 | bm25ec | DisjunctiveQueryDistributor | 16 | no | N/A | N/A |
| 200 | 53 | bm25ec | DisjunctiveQueryDistributor | 16 | no | N/A | N/A |
| 20 | 60 | bm25ec | DisjunctiveQueryDistributor | 16 | yes | 15 | 15 |
| 80 | 61 | bm25ec | DisjunctiveQueryDistributor | 16 | yes | 10 | 15 |
| 160 | 62 | bm25ec | DisjunctiveQueryDistributor | 16 | yes | 5 | 15 |
| 200 | 63 | bm25ec | DisjunctiveQueryDistributor | 16 | yes | 15 | 20 |
Figure 3Sensitivity of the evaluation to the search method S. Two different search methods were used in Step 1 of the high-recall evaluation (n = 29 search methods tested). The panels show MAP and bpref agreement between these two runs. A stronger agreement is observed for bpref than for MAP (MAP/MAP correlation coefficient: 0.9540, bpref/bpref: 0.9740). These results indicate that the high-recall evaluation protocol produces performance measures which are marginally dependent on the choice of the Smethod used to perform Step 1.
Figure 4Evaluations with a different sample of search methods. Different search methods than used in Figure 1 were evaluated with NT evaluation and with TREC Genomics Track 2004 and 2005 relevance judgments. Pearson correlation coefficients are shown in each scatter plot (values in parentheses are Spearman rank correlation coefficients). As for the sample used in Figure 1, better correlations are observed when bpref measures are compared (top row of scatter plots) vs. MAP measures (bottom row).
Figure 5NT Evaluation predicts favorable regions of the search parameter space. Each contour plot shows how retrieval performance changes with the value of parameters k1 and b of the Okapi BM25 search method. The top-left plot is constructed for focused searches. The two plots on the right are constructed with TREC Genomics Track relevance judgments. The plot on the bottom left is constructed with the high-recall NT Evaluation protocol. High-recall NT Evaluation and TREC Genomics evaluations show similar performance contours with respect to parameters, suggesting that NT Evaluation can be used to select reasonable search engine parameters without human relevance judgments.