| Literature DB >> 31709390 |
Yanshan Wang1, Andrew Wen1, Sijia Liu1, William Hersh2, Steven Bedrick3, Hongfang Liu1.
Abstract
OBJECTIVES: To create test collections for evaluating clinical information retrieval (IR) systems and advancing clinical IR research.Entities:
Keywords: electronic health records; evaluation; information retrieval; relevance judgment; test collections
Year: 2019 PMID: 31709390 PMCID: PMC6824517 DOI: 10.1093/jamiaopen/ooz016
Source DB: PubMed Journal: JAMIA Open ISSN: 2574-2531
Figure 1.A general IR framework used for retrieval of clinical documents. IR, information retrieval.
Figure 2.Test collections in the evaluation of a practical IR framework. IR, information retrieval
Figure 3.Hierarchical index structure in Elasticsearch.
Information Retrieval models for the generation of document pool
| Retrieval models | Description |
|---|---|
| tf-idf based VSM | Vector Space Model (VSM) |
| BM25 | Okapi BM25 |
| Dirichlet LM | Language models |
| MRF | Markov Random field (MRF) model |
| CREATE | CREATE |
Abbreviation: IR, information retrieval; LM, Language Model; MRF, Markov Random field; VSM, Vector Space Model.
Examples of three levels of relevance
| Document | Judgment | Reason |
|---|---|---|
| … The patient with autism and cerebral palsy was treated today … | Nonrelevant | The patient has cerebral palsy which is the exclusion criteria. |
| … He appears to have autism … | Partially relevant | The result could be relevant because it mentions autism. But it does not mention any of exclusion conditions. |
| … Patient has autism and doesn’t have any of neurodevelopmental disorders …. | Relevant | Content meets all criteria. |
Figure 4.Agreement amongst two expert judges for each topic.
Figure 5.Time two judges spent per document for each topic.
Figure 6.Results of three levels of relevance for each topic.
Performance of IR systems in terms of MAP, Rprec, P@10, NDCG, and infAP
| IR model | MAP | Rprec | P@10 | NDCG | infAP |
|---|---|---|---|---|---|
| tf-idf-based VSM |
|
|
|
|
|
| BM25 | 0.3091 | 0.3524 | 0.6239 | 0.5622 | 0.3091 |
| Dirichlet LM | 0.2027 | 0.2577 | 0.6370 | 0.4556 | 0.2027 |
| MRF | 0.2060 | 0.2576 | 0.4783 | 0.4088 | 0.2060 |
| CREATE | 0.2343 | 0.2852 | 0.6065 | 0.4316 | 0.2343 |
Abbreviation: IR, information retrieval; LM, Language Model; MRF, Markov Random field; VSM, Vector Space Model.
A bold value indicates the best performance for that metric.
Performance of IR systems per topic in terms of MAP
| Topic ID | tf-idf based VSM | BM25 | Dirichlet LM | MRF | CREATE |
|---|---|---|---|---|---|
| 1 | 0.2996 | 0.0688 |
| 0.2359 | 0.0003 |
| 2 |
| 0.2936 | 0.2996 | 0.0908 | 0.2612 |
| 3 |
| 0.3126 | 0.1719 | 0.0987 | 0.0892 |
| 4 | 0.5801 | 0.5670 | 0.2059 |
| 0.5902 |
| 6 |
| 0.3093 | 0.2616 | 0.0066 | 0.0946 |
| 7 |
| 0.3286 | 0.3492 | 0.2544 | 0.3499 |
| 8 | 0.2770 | 0.2169 |
| 0.0040 | 0.2326 |
| 9 |
| 0.3134 | 0.3561 | 0.1827 | 0.1681 |
| 10 |
| 0.0830 | 0.1089 | 0.1036 | 0.0791 |
| 11 | 0.0230 | 0.0642 |
| 0.0202 | 0.0030 |
| 13* | 0.1402 | 0.2031 | 0.0226 | 0.1570 |
|
| 14 | 0.3229 | 0.2543 | 0.1353 | 0.2659 |
|
| 15 |
| 0.3199 | 0.2323 | 0.1938 | 0.3200 |
| 16 | 0.4026 |
| 0.4300 | 0.2818 | 0.2969 |
| 17 |
| 0.4365 | 0.2042 | 0.2407 | 0.1776 |
| 18 | 0.3832 |
| 0.1164 | 0.4491 | 0.2366 |
| 20 | 0.2403 | 0.1316 |
| 0.2021 | 0.0274 |
| 23 |
| 0.4116 | 0.2499 | 0.2752 | 0.3718 |
| 24* | 0.0752 | 0.1731 | 0.0792 |
| 0.1895 |
| 25 |
| 0.7149 | 0.6093 | 0.0295 | 0 |
| 26 |
| 0.4222 | 0.4102 | 0.3195 | 0.0484 |
| 27 |
| 0.2377 | 0.0799 | 0.0039 | 0.0827 |
| 29 |
| 0.1725 | 0.1343 | 0.1224 | 0.0534 |
| 30 | 0.3351 |
| 0.1647 | 0.0648 | 0.0387 |
| 31 |
| 0.7392 | 0.1099 | 0.6502 | 0.6503 |
| 32 |
| 0.5397 | 0.1635 | 0 | 0.4634 |
| 33 |
| 0.6960 | 0.5149 | 0.4315 | 0.6180 |
| 34 | 0.2579 | 0.2254 | 0.1957 |
| 0.0675 |
| 35 |
| 0.0690 | 0.0625 | 0.0771 | 0.0095 |
| 36* | 0.0144 | 0.0226 | 0.0584 | 0.0226 |
|
| 37 | 0 | 0 | 0 | 0 | 0 |
| 39 | 0.4873 | 0.5137 | 0.4354 |
| 0.2009 |
| 40 |
| 0.2128 | 0.1686 | 0 | 0.1181 |
| 41 |
| 0.5269 | 0.1582 | 0.4686 | 0.3117 |
| 42 |
| 0.4290 | 0.2942 | 0.4482 | 0.1368 |
| 43* | 0.0628 | 0.1250 | 0.0026 | 0.0483 |
|
| 44* | 0.2251 | 0.1085 | 0.1394 | 0.0105 |
|
| 45 |
| 0.3931 | 0.2616 | 0.2719 | 0.1486 |
| 47 | 0.4217 |
| 0.1989 | 0.1265 | 0.2530 |
| 48* | 0.0144 | 0.0163 | 0.0051 | 0.0833 |
|
| 49 |
| 0.3130 | 0.2970 | 0.2467 | 0.1911 |
| 50 | 0.3704 |
| 0.2421 | 0.3767 |
|
| 51* | 0.2048 | 0.1182 | 0.0916 | 0.0017 |
|
| 52 | 0.4720 |
| 0.0359 | 0.5447 | 0.2462 |
| 53* | 0.0216 | 0.0042 | 0.0197 | 0.0062 |
|
| 55 | 0.4435 |
| 0.1463 | 0.4316 | 0.1766 |
Note: The topics for which the CREATE significantly outperforms the tf-idf-based VSM using t-test (P < .01) are marked by the asterisk (*).
Abbreviation: IR, information retrieval; LM, Language Model; MRF, Markov Random field; VSM, Vector Space Model.
A bold value indicates the best performance for that topic.