| Literature DB >> 33197268 |
Abstract
The COVID-19 pandemic has resulted in a tremendous need for access to the latest scientific information, leading to both corpora for COVID-19 literature and search engines to query such data. While most search engine research is performed in academia with rigorous evaluation, major commercial companies dominate the web search market. Thus, it is expected that commercial pandemic-specific search engines will gain much higher traction than academic alternatives, leading to questions about the empirical performance of these tools. This paper seeks to empirically evaluate two commercial search engines for COVID-19 (Google and Amazon) in comparison with academic prototypes evaluated in the TREC-COVID task. We performed several steps to reduce bias in the manual judgments to ensure a fair comparison of all systems. We find the commercial search engines sizably underperformed those evaluated under TREC-COVID. This has implications for trust in popular health search engines and developing biomedical search engines for future health crises.Entities:
Keywords: COVID-19; TREC-COVID; coronavirus; information retrieval
Mesh:
Year: 2021 PMID: 33197268 PMCID: PMC7717324 DOI: 10.1093/jamia/ocaa271
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Four example topics from Round 1 of the TREC-COVID challenge. A category is assigned (for this paper, not TREC-COVID) to each topic based on both the topic’s research field and function, which allows us to classify the performance of the systems on certain kinds of topics.
| Topic 10 |
|
| Topic 13 |
|
| Topic 22 |
|
| Topic 30 |
|
Figure 1.A bar chart with the number of documents for each topic as used in our evaluations (after filtering the documents based on the April 10 release of the CORD-19 dataset and setting a threshold at the minimum number of documents for any given topic). The total numbers of documents annotated additionally for relevance and error analysis are shown as circle and cross marks on the bars corresponding to each topic. Note that these additional documents are at topic level and thus can be more than the number of documents per system shown in the figure using bars.
Evaluation results after setting a threshold at the number of documents per topic using a minimum number of documents present for each individual topic. The relevance judgments used are a combination of Rounds 1 and 2 of TREC-COVID and our additional relevance assessments. The highest scores for the evaluated and TREC-COVID systems are underlined.
| System | P@5 | P@10 | NDCG@10 | MAP | NDCG | bpref | |
|---|---|---|---|---|---|---|---|
| Amazon | question | 0.6733 | 0.6333 | 0.5390 | 0.0722 | 0.1838 | 0.1049 |
| question + narrative |
|
|
|
|
| 0.1063 | |
| question | 0.5733 | 0.5700 | 0.4972 | 0.0693 | 0.1831 |
| |
| question + narrative | 0.6067 | 0.5600 | 0.5112 | 0.0687 | 0.1821 | 0.1054 | |
| TREC-COVID | 1. sab20.1.meta.docs |
|
|
|
|
|
|
| 2. sab20.1.merged | 0.6733 | 0.6433 | 0.5555 | 0.0787 | 0.1971 | 0.1154 | |
| 3. UIowaS_Run3 | 0.6467 | 0.6367 | 0.5466 | 0.0952 | 0.2091 | 0.1279 | |
| 4. smith.rm3 | 0.6467 | 0.6133 | 0.5225 | 0.0914 | 0.2095 | 0.1303 | |
| 5. udel_fang_run3 | 0.6333 | 0.6133 | 0.5398 | 0.0857 | 0.1977 | 0.1187 | |
Figure 2.Analysis of system performances on the basis of different categories of topics. Research Field – categories based on the field of study in biomedical informatics. Function – based on the functional aspect of COVID-19 as expressed in the topic’s information need.
Figure 3.Total number of documents retrieved by the systems (among the top 10 documents per topic) based on different categories of errors. NA to COVID-19 – document not applicable to COVID-19. Tangential – not relevant at all. Partially Tangential – not relevant but there is a common link with the topic (e.g., quarantine). Partially Relevant – answers only a part of the topic. Relevant – provides an answer to the topic.