| Literature DB >> 36009848 |
Bahaj Adil1, Safae Lhazmir1, Mounir Ghogho1,2, Houda Benbrahim3.
Abstract
The urgency of the COVID-19 pandemic caused a surge in the related scientific literature. This surge made the manual exploration of scientific articles time-consuming and inefficient. Therefore, a range of exploratory search applications have been created to facilitate access to the available literature. In this survey, we give a short description of certain efforts in this direction and explore the different approaches that they used.Entities:
Keywords: COVID-19; document retrieval; exploratory search; machine learning
Year: 2022 PMID: 36009848 PMCID: PMC9404775 DOI: 10.3390/biology11081221
Source DB: PubMed Journal: Biology (Basel) ISSN: 2079-7737
Figure 1Application Development Phases.
Summary of the Datasets. NER refers to Named Entity Recognition, RE refers to Relationship Extraction, SMZ refers to summarization, QA refers to Question Answering, DR refers to Document Retrieval.
| Dataset | Application Refs | Tasks | Statistics | URL |
|---|---|---|---|---|
| TREC-COVID [ | [ | DR | The TREC-COVID dataset has many versions which correspond to TREC-COVID challenges. For example, round three contains a total of 16,677 unique journal articles in CORD-19 with a relevance annotation. | |
| COVIDQA * [ | [ | QA | The dataset contains 147 question–article–answer triples with 27 unique questions and 104 unique articles. | |
| COVID-19 Questions * [ | [ | QA | The dataset contains 111 question–answer pairs with 53 interrogative and 58 keyword-style queries. | |
| COVID-QA * [ | [ | QA | The dataset consists of 2019 question–article–answer triples. | |
| InfoBot Dataset * [ | [ | QA, FAQ | 2200 COVID-19-related Frequently asked Question–Answer pairs. | |
| MS-MARCO [ | [ | QA | 1,000,000 training instances. | |
| Med-MARCO [ | [ | QA | 79K of the original MS-MARCO questions (9.7%). | |
| Natural Questions [ | [ | QA | The public release consists of 307,373 training examples with single annotations; 7830 examples with 5-way annotations for development data; and a further 7842 examples with 5-way annotated sequestered as test data. | |
| SQuAD [ | [ | QA | The dataset contains 107,785 question–answer pairs on 536 articles. | |
| BioASQ [ | [ | QA, DR | 500 questions with their relevant documents, text span answers and perfect answers. | |
| M-CID [ | [ | QA | The dataset is composed of 6871 natural language utterances across 16 COVID-19-specific intents and 4 languages: English, Spanish, French and German. | |
| QuAC [ | [ | QA | 14K information-seeking QA dialogs, and 100K questions in total. | |
| GENIA [ | [ | NER | 2000 abstracts taken from the MEDLINE database; contains more than 400,000 words and almost 100,000 annotations. | |
| DUC 2005, 2006 [ | [ | SMZ | The dataset is composed of 50 topics. | |
| Debatepedia [ | [ | SMZ | It consists of 10,859 training examples, 1357 testing and 1357 validation samples. The average number of words in summary, documents and query is 11.16, 66.4 and 10, respectively. | |
| JNLPBA [ | [ | NER | This dataset contains a subset of the GENIA dataset V3.02. This subset is composed of 2404 abstracts. The articles were chosen to contain the MeSH terms “human”, “blood cells” and “transcription factors”, and their publication year ranges from 1990 to 1999. | |
| CHEMDNER [ | [ | NER | 10,000 PubMed abstracts that contain a total of 84,355 chemical entities. | |
| NCBI Disease Corpus [ | [ | NER | 793 PubMed abstracts that were annotated. A total of 6892 disease mentions, which are mapped to 790 unique disease concepts that were extracted. | |
| CHEMPROT [ | [ | NER, RE | 2500 PubMed abstracts, from which 32,000 chemical entities and 31,000 protein entities were extracted. In addition, 10,000 chemical-protein relationships were extracted. | |
| BC5CDR [ | [ | NER, RE | 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. | |
| COV19_729 * [ | [ | NER | The dataset is composed of 729 examples. Each example is a triple comprising an entity, the class that that entity belongs to (i.e., disease, protein, chemical), and a physician’s rating of how related those entities are to COVID-19. |
Examples of Entities Specifications.
| Entities | Properties | Description | ID |
|---|---|---|---|
| Paper | title, publication date, journal, Digital Object Identifier (DOI), link | Representation of research paper entities. | E1 |
| Author | identifier, first names, middle names, last names | Representation of the paper authors. | E2 |
| Affiliation | identifier, name, country, city | Representation of a research structure where an author belongs. | E3 |
| Concept | concept identifier, textual value, concept type (gene, disease, topic, chemical, etc.) | Representation of a domain specific concept. | E4 |
Examples of Relations.
| Source Entity | Dest. Entity | Relation | Description | ID |
|---|---|---|---|---|
| Paper | Paper | cites | This relation connects paper entities with paper references indicating a citation relation. | R1 |
| Author | Author | co-author | This relation connects an author entity with another author entity indicating a co-authorship relation. | R2 |
| Concept | Concept | relate concepts | This relationship links two concepts with any general relationship that might link them. | R3 |
| Paper | Author | authored by | This relation connects paper entities with author entities and indicates an authorship relation. | R4 |
| Paper | Concept | associated concept | This relation connects paper entities with concept entities. | R5 |
| Author | Affiliation | affiliated with | This relation connects author entities with institution entities. | R6 |
| Author | Concept | research area | This relation connects author entities with concept entities indicating a research area of the author. | R7 |
Summary of Knowledge Graphs Related to COVID-19.
| KG | Usage | Ent. | Rel. |
|---|---|---|---|
| CKG [ | Article recommendations, citation-based navigation, and search result ranking. | E1, E2, E3, E4 | R1, R4, R6, R5 |
| CovEx KG [ | Document Retrieval. | E1, E2, E4 | R1, R4, R5, R7 |
| ERLKG [ | Link prediction. | E4 | R3 |
| COVID-KG [ | QA, Semantic Visualization, Drug Re-purposing. | E4 | R3 |
| COFIE KG [ | KG search over relations and entities using a query. | E4 | R3 |
| Network Visualization KG [ | Data Visualization. | E4 | R3 |
| Vapur KG [ | Query extension. | E4 | R3 |
| Citation KG [ | Document Ranking. | E1 | R1 |
Figure 2Summary of The Development Process of Literature Search Engines.
Search Engine Comparison. “❙” signifies the existence of the corresponding characteristic, and “✓” signifies the lack of it. Marks between parentheses correspond to characteristics between parentheses.
| System | CO-Search [ | AWS CORD-19 Search (ACS) [ | COVID-19 Drug Repository [ | CovEx [ | COVIDex [ | Vapur [ | COVIDASK [ | [ | CAiRE-COVID [ | [ | CORD19-Explorer [ | SLEDGE-Z [ | S_COVID [ | [ | SLIC [ | SPIKE [ | EVIDENCEMINER [ | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Uses Raw Text (Uses KG) | ✓(❙) | ✓(✓) | ✓(❙) | ✓(✓) | ✓(❙) | ✓(✓) | ✓(❙) | ✓(❙) | ✓(❙) | ✓(❙) | ✓(❙) | ✓(❙) | ✓(❙) | ✓(✓) | ✓(❙) | ✓(❙) | ✓(✓) | |
| Publicly Available | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ❙ | ✓ | ❙ | ✓ | ❙ | ✓ | ❙ | ❙ | ✓ | ✓ | |
| Feedback Loop | ❙ | ❙ | ❙ | ✓ | ✓ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | |
| Multistage Ranking | ❙ | ❙ | ❙ | ✓ | ✓ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ✓ | ✓ | ❙ | ❙ | ❙ | |
| KG Traversal | ❙ | ✓ | ❙ | ❙ | ❙ | ✓ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ✓ | ❙ | ❙ | ✓ | |
| Text Representations Levels (KG Representation Level) | Document (KG) | ✓(❙) | ✓(❙) | ❙(❙) | ❙(❙) | ✓(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ✓(❙) | ❙(❙) | ❙(❙) | ✓(❙) | ✓(❙) | ✓(❙) | ✓(❙) | ❙(❙) | ❙(❙) |
| Paragraph (Sub-graph) | ✓(❙) | ✓(✓) | ❙(❙) | ❙(❙) | ✓(❙) | ❙(❙) | ❙(❙) | ✓(❙) | ✓(❙) | ✓(❙) | ❙(❙) | ✓(❙) | ❙(❙) | ❙(✓) | ❙(❙) | ✓(❙) | ❙(❙) | |
| Sentence (Edge) | ✓(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ✓(❙) | ❙(❙) | ✓(❙) | ✓(❙) | ✓(❙) | ❙(❙) | ✓(✓) | ✓(❙) | ❙(❙) | ✓(❙) | ✓(❙) | |
| Word (Node) | ❙(❙) | ❙(❙) | ❙(❙) | ✓(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ✓(❙) | ❙(❙) | ❙(❙) | ✓(❙) | ✓(❙) | |
| n-gram (Node Property) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ✓(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙)) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | |
| Keyphrase (Edge Property) | ❙(❙) | ❙(❙) | ✓(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ❙(❙) | ✓(❙) | ✓(❙) | |
| Rep.Comb. | Inter-Level | ✓ | ❙ | ❙ | ❙ | ✓ | ❙ | ❙ | ❙ | ✓ | ❙ | ❙ | ✓ | ✓ | ✓ | N❙ | ✓ | ❙ |
| Text & KG | ❙ | ✓ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ✓ | ❙ | ❙ | ❙ | |
| Tasks | Document Retrieval (Indexing, Ranking) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ❙ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Passage Retrieval (Indexing, Ranking) | ✓ | ✓ | ❙ | ❙ | ✓ | ❙ | ✓ | ✓ | ✓ | ✓ | ❙ | ❙ | ✓ | ✓ | ❙ | ✓ | ✓ | |
| Question Answering | ✓ | ✓ | ❙ | ❙ | ✓ | ❙ | ✓ | ✓ | ✓ | ✓ | ❙ | ❙ | ❙ | ✓ | ❙ | ❙ | ❙ | |
| Summarization | ✓ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ✓ | ❙ | ❙ | ❙ | ❙ | ✓ | ❙ | ❙ | ❙ | |
| Topic Modeling | ❙ | ✓ | ❙ | ✓ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ✓ | ✓ | ❙ | ❙ | ❙ | |
| Recommendation | ❙ | ✓ | ❙ | ✓ | ❙ | ✓ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | |
| FAQ Matching | ❙ | ✓ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | |
| Search Type | Keyword | ❙ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ❙ | ✓ | ❙ | ✓ | ❙ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Open Questions | ✓ | ✓ | ❙ | ❙ | ❙ | ❙ | ✓ | ✓ | ✓ | ✓ | ❙ | ✓ | ✓ | ✓ | ✓ | ❙ | ✓ | |
| Keyphrases | ❙ | ❙ | ❙ | ✓ | ❙ | ❙ | ✓ | ❙ | ✓ | ❙ | ✓ | ❙ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Regular Expression | ❙ | ❙ | ✓ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ✓ | ❙ | |
| Novelty | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ✓ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | |
| Data Enrichment | From External Resources | ❙ | ✓ | ✓ | ❙ | ✓ | ✓ | ✓ | ❙ | ❙ | ✓ | ✓ | ✓ | ❙ | ❙ | ✓ | ❙ | ✓ |
| From Internal Resources | ✓ | ✓ | ✓ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | ✓ | ✓ | ❙ | ❙ | ❙ | ❙ | ❙ | ❙ | |
Figure 3Summary of Exploratory Search Application Creation Process.
Figure 4TopicMaps Interface.
Exploratory Search Applications Summary. All links have been last accessed in 4 April 2022.
| System | Vidar-19 [ | TopicMaps [ | Network Visualisations [ | SciSight [ | Semviz [ | EvidenceMiner [ | |
|---|---|---|---|---|---|---|---|
| Available Charts | Pie Chart | ❙ | ❙ | ❙ | ❙ | ❙ | ✓ |
| Histogram | ✓ | ✓ | ❙ | ✓ | ❙ | ❙ | |
| Data Tables | ❙ | ✓ | ❙ | ❙ | ✓ | ❙ | |
| Heat Map | ❙ | ❙ | ❙ | ❙ | ✓ | ❙ | |
| Tile Chart | ✓ | ❙ | ❙ | ❙ | ❙ | ❙ | |
| Word Cloud | ❙ | ✓ | ❙ | ✓ | ✓ | ❙ | |
| Stacked Barplot | ✓ | ❙ | ❙ | ❙ | ❙ | ❙ | |
| Bar Plot | ❙ | ❙ | ❙ | ❙ | ✓ | ✓ | |
| Bubble Maps | ❙ | ✓ | ❙ | ❙ | ❙ | ❙ | |
| Network/Graph | ❙ | ❙ | ✓ | ✓ | ❙ | ❙ | |
| Chord Diagram | ❙ | ❙ | ❙ | ✓ | ❙ | ❙ | |
| Indicators | Frequency | ✓ | ✓ | ✓ | ✓ | ❙ | ❙ |
| Count | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Other Indicators | ❙ | ✓ | ❙ | ❙ | ❙ | ❙ | |
| Related Tasks | IE | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Topic Modeling | ❙ | ✓ | ❙ | ✓ | ❙ | ❙ | |
| NER | ❙ | ❙ | ✓ | ✓ | ✓ | ❙ | |
| Network Analysis | ❙ | ❙ | ✓ | ✓ | ❙ | ❙ | |
| Data Source | Raw Text | ✓ | ✓ | ❙ | ✓ | ✓ | ✓ |
| KG | ❙ | ❙ | ✓ | ✓ | ✓ | ✓ | |
| Reactivity | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Public Availability | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Figure 5Network Visualization Interface.