| Literature DB >> 16029513 |
Peng Dong1, Marie Loh, Adrian Mondry.
Abstract
BACKGROUND: Relevance assessment is a major problem in the evaluation of information retrieval systems. The work presented here introduces a new parameter, "Relevance Similarity", for the measurement of the variation of relevance assessment. In a situation where individual assessment can be compared with a gold standard, this parameter is used to study the effect of such variation on the performance of a medical information retrieval system. In such a setting, Relevance Similarity is the ratio of assessors who rank a given document same as the gold standard over the total number of assessors in the group.Entities:
Year: 2005 PMID: 16029513 PMCID: PMC1181804 DOI: 10.1186/1742-5581-2-6
Source DB: PubMed Journal: Biomed Digit Libr ISSN: 1742-5581
Figure 1Workflow for analysing the effect of the inter-evaluator variation on CAT Crawler information retrieval system.
CAT Link retrieval details. The numbers indicate how many documents were retrieved by the CAT Crawler meta-search engine.
| Appendicitis | 8 |
| Colic | 9 |
| Intubation | 22 |
| Ketoacidosis | 2 |
| Octreotide | 3 |
| Palsy | 10 |
| Prophylaxis | 30 |
| Sleep | 16 |
| Tape | 3 |
| Ultrasound | 29 |
| 132 |
Relevance Similarity for 132 retrieved CAT links. For each of the 132 documents retrieved by the CAT Crawler meta-search engine, Relevance Similarity (in %) was calculated for both Group A and B. Link S/N attribute is the serial number to each document.
| 1 | 100 | 83.33 | 45 | 100 | 100 | 89 | 66.67 | 33.33 |
| 2 | 83.33 | 66.67 | 46 | 50 | 50 | 90 | 50 | 83.33 |
| 3 | 100 | 100 | 47 | 50 | 33.33 | 91 | 50 | 66.67 |
| 4 | 100 | 100 | 48 | 50 | 66.67 | 92 | 66.67 | 83.33 |
| 5 | 100 | 100 | 49 | 100 | 100 | 93 | 33.33 | 83.33 |
| 6 | 100 | 100 | 50 | 100 | 100 | 94 | 50 | 50 |
| 7 | 100 | 100 | 51 | 100 | 100 | 95 | 33.33 | 66.67 |
| 8 | 100 | 100 | 52 | 66.67 | 50 | 96 | 100 | 100 |
| 9 | 100 | 100 | 53 | 33.33 | 16.67 | 97 | 100 | 100 |
| 10 | 100 | 100 | 54 | 100 | 100 | 98 | 100 | 100 |
| 11 | 0 | 0 | 55 | 100 | 100 | 99 | 100 | 66.67 |
| 12 | 100 | 100 | 56 | 100 | 100 | 100 | 66.67 | 66.67 |
| 13 | 100 | 100 | 57 | 100 | 100 | 101 | 100 | 100 |
| 14 | 100 | 100 | 58 | 100 | 100 | 102 | 66.67 | 83.33 |
| 15 | 100 | 100 | 59 | 100 | 100 | 103 | 100 | 100 |
| 16 | 100 | 100 | 60 | 100 | 100 | 104 | 83.33 | 100 |
| 17 | 83.33 | 83.33 | 61 | 66.67 | 33.33 | 105 | 100 | 100 |
| 18 | 100 | 100 | 62 | 83.33 | 33.33 | 106 | 100 | 100 |
| 19 | 66.67 | 83.33 | 63 | 16.67 | 83.33 | 107 | 100 | 100 |
| 20 | 66.67 | 50 | 64 | 50 | 83.33 | 108 | 100 | 100 |
| 21 | 50 | 66.67 | 65 | 100 | 100 | 109 | 100 | 100 |
| 22 | 33.33 | 66.67 | 66 | 66.67 | 83.33 | 110 | 83.33 | 66.67 |
| 23 | 100 | 100 | 67 | 0 | 33.33 | 111 | 83.33 | 83.33 |
| 24 | 50 | 50 | 68 | 50 | 50 | 112 | 83.33 | 100 |
| 25 | 50 | 66.67 | 69 | 0 | 16.67 | 113 | 83.33 | 33.33 |
| 26 | 83.33 | 50 | 70 | 66.67 | 50 | 114 | 100 | 100 |
| 27 | 66.67 | 100 | 71 | 83.33 | 66.67 | 115 | 100 | 100 |
| 28 | 50 | 66.67 | 72 | 100 | 83.33 | 116 | 100 | 100 |
| 29 | 100 | 100 | 73 | 50 | 83.33 | 117 | 83.33 | 66.67 |
| 30 | 50 | 50 | 74 | 100 | 66.67 | 118 | 83.33 | 66.67 |
| 31 | 100 | 100 | 75 | 100 | 83.33 | 119 | 83.33 | 66.67 |
| 32 | 83.33 | 83.33 | 76 | 100 | 100 | 120 | 83.33 | 66.67 |
| 33 | 100 | 66.67 | 77 | 100 | 83.33 | 121 | 100 | 66.67 |
| 34 | 100 | 83.33 | 78 | 66.67 | 50 | 122 | 100 | 83.33 |
| 35 | 100 | 100 | 79 | 83.33 | 83.33 | 123 | 100 | 83.33 |
| 36 | 100 | 100 | 80 | 100 | 100 | 124 | 100 | 100 |
| 37 | 33.33 | 16.67 | 81 | 83.33 | 66.67 | 125 | 100 | 100 |
| 38 | 83.33 | 66.67 | 82 | 100 | 66.67 | 126 | 66.67 | 66.67 |
| 39 | 66.67 | 50 | 83 | 66.67 | 33.33 | 127 | 100 | 100 |
| 40 | 50 | 50 | 84 | 83.33 | 66.67 | 128 | 83.33 | 33.33 |
| 41 | 100 | 100 | 85 | 83.33 | 100 | 129 | 33.33 | 50 |
| 42 | 100 | 100 | 86 | 100 | 100 | 130 | 66.67 | 83.33 |
| 43 | 16.67 | 50 | 87 | 83.33 | 100 | 131 | 83.33 | 66.67 |
| 44 | 100 | 100 | 88 | 33.33 | 50 | 132 | 83.33 | 100 |
Figure 2Frequency analysis of evaluation similarity of Group A and B versus the gold standard for all 132 CATs. Compared to the gold standard, the blue bar indicates the number of CATs evaluated by Group A at a different similarity level; the red bar indicates the number of CATs evaluated by Group B at a different similarity level.
Average recall for the gold standard and the two groups of evaluators
| 100.00 | 97.92 | 93.75 | |
| 53.33 | 58.89 | 58.89 | |
| 37.84 | 41.44 | 40.09 | |
| 33.33 | 50.00 | 50.00 | |
| 75.00 | 54.17 | 62.50 | |
| 54.55 | 65.15 | 65.15 | |
| 64.86 | 69.82 | 56.76 | |
| 43.75 | 59.38 | 51.04 | |
| 50.00 | 44.44 | 47.22 | |
| 36.17 | 38.30 | 39.36 | |
| 54.88 | 57.95 | 56.48 |
Average precision for the gold standard and the two groups of evaluators
| 100.00 | 97.92 | 93.75 | |
| 88.89 | 98.15 | 98.15 | |
| 63.64 | 69.70 | 67.42 | |
| 50.00 | 75.00 | 75.00 | |
| 100.00 | 72.22 | 83.33 | |
| 60.00 | 71.67 | 71.67 | |
| 80.00 | 86.11 | 70.00 | |
| 43.75 | 59.38 | 51.04 | |
| 100.00 | 88.89 | 94.44 | |
| 58.62 | 62.07 | 63.79 | |
| 74.49 | 78.11 | 76.86 |
Figure 3Recall comparison. The bars indicate each of the three groups' recall (in %) for the ten keywords.
Figure 4Precision comparison. The bars indicate each of the three groups' precision (in %) for the ten keywords.
Kappa scores within Group A and Group B, de monstrating the paradoxically low kappa scores despite high agreement.
| Evaluator | 2 | 3 | 4 | 5 | 6 | 2 | 3 | 4 | 5 | 6 |
| 1 | 0.404 | 0.426 | 0.136 | 0.258 | 0.656 | 0.208 | 0.670 | 0.410 | 0.807 | 0.352 |
| 2 | 0.461 | 0.259 | 0.713 | 0.520 | 0.257 | 0.135 | 0.125 | -0.001 | ||
| 3 | 0.180 | 0.438 | 0.439 | 0.440 | 0.643 | 0.353 | ||||
| 4 | 0.241 | 0.270 | 0.370 | 0.250 | ||||||
| 5 | 0.404 | 0.330 | ||||||||