| Literature DB >> 31419935 |
Jurica Ševa1, David Luis Wiegandt1, Julian Götze2, Mario Lamping3, Damian Rieke3,4,5, Reinhold Schäfer3,6, Patrick Jähnichen1, Madeleine Kittner1, Steffen Pallarz1, Johannes Starlinger1, Ulrich Keilholz3, Ulf Leser7.
Abstract
BACKGROUND: Diagnosis and treatment decisions in cancer increasingly depend on a detailed analysis of the mutational status of a patient's genome. This analysis relies on previously published information regarding the association of variations to disease progression and possible interventions. Clinicians to a large degree use biomedical search engines to obtain such information; however, the vast majority of scientific publications focus on basic science and have no direct clinical impact. We develop the Variant-Information Search Tool (VIST), a search engine designed for the targeted search of clinically relevant publications given an oncological mutation profile.Entities:
Keywords: Biomedical information retrieval; Clinical relevance; Document classification; Document retrieval; Document triage; Personalized oncology
Mesh:
Year: 2019 PMID: 31419935 PMCID: PMC6697931 DOI: 10.1186/s12859-019-2958-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1VIST System Architecture. Left: VIST backend with indexed and preprocessed documents. Right: VIST web interface for query processing and result presentation
VIST Index Summary
| Property | Count |
|---|---|
| Indexed documents | 29,711,223 |
| Classified as related to cancer | 630,512 |
| Classified as clinically relevant | 5,375,192 |
| Clinically relevant & cancer | 349,351 |
| Distinct variations | 433,882 |
| Documents with >0 variations | 323,722 |
| Total number of variations | 1,018,321 |
Fig. 4Precision (P),Recall (R) and F1 scores of three evaluated classification tasks, i.e., classification by relatedness to cancer, by clinical relevance, and by cancer type. MTL: Multi-Task Learning; HATT: Hierarchical Attention Network; SVM: Support Vector Machine; RF: Random Forest
Document counts of corpora used for document classification
| Corpus | Size | Cancer+ | Cancer- | Relevant+ | Relevant- |
|---|---|---|---|---|---|
| CiVIC | 1,414 | 1,346 | 68 | 1,414 | 0 |
| PubMed | 20,017 | 0 | 20,017 | 0 | 20,017 |
Overview of corpora used for evaluation
| Corpus Property / Corpus | User Study | TREC PM 2017 | Tumorboard |
|---|---|---|---|
| Queries | 14 | 27 | 261 |
| Documents | 101 | 19,284 | 471 |
| Unique Documents | 96 | 16,359 | 325 |
| Documents/Query | 5.94 | 714.22 | 1.80 |
| Relevant Documents | 45 | 1,724 | 471 |
| Relevant Unique Documents | 44 | 1,681 | 325 |
| Relevant/Query | 3.21 | 63.85 | 1.80 |
| Irrelevant Documents | 56 | 17,560 | - |
| Irrelevant Unique Documents | 53 | 14,980 | - |
| Irrelevant/Query | 3.29 | 650.37 | - |
Properties are expressed as number of occurrences
Fig. 2VIST web interface: Top: Search bar for entering queries. Left: Filter options (by keywords, genes, journals, cancer type, and year of publication. Main pane: List of matching documents, ranked by score according to clinical relevance. Matching clinical trials are available as a second tab
Fig. 3Detailed view on matching document in VIST. Entities (genes, drugs, variations) as recognized by VIST’s NER modules are highlighted. Sentences are colored according to the propbability of carrying the main message of the abstract (key phrases)
Best performing ranking functions
| Models | SVM | MTL | ||||||
|---|---|---|---|---|---|---|---|---|
| Rank by: | Recall | MAP | MRR | nDCG | Recall | MAP | MRR | nDCG |
| RankScorê |
|
|
|
|
| 0.088 | 0.119 | 0.260 |
| PubDate * RankScore | 0.634 |
| 0.168 | 0.306 | 0.560 | 0.083 | 0.109 | 0.254 |
| CancerScore | 0.618 | 0.092 | 0.115 | 0.274 | 0.569 |
|
|
|
| KeywordScore | 0.291 | 0.018 | 0.025 | 0.125 | 0.294 | 0.018 | 0.025 | 0.125 |
All elements of a ranking function are sorted descending. The KeywordScore, completely neglecting cancer relatedness and clinical relevance of documents, is included as baseline. ^ used in production version of VIST
Evaluation results on several datasets and several metrics
| Dataset | System | MAP | MRR | nDCG | # Best Rel vs IrRel |
|---|---|---|---|---|---|
| TREC PM 2017 | KeywordScore | 0.0006 | 0.066 | 0.426 | 2 |
| PubMed |
| 0.056 |
| 5 | |
| VIST MTL | 0.0003 | 0.051 | 0.238 |
| |
|
|
|
| 0.458 |
| |
| Tumorboard | KeywordScore | 0.0082 | 0.011 | 0.115 | - |
| PubMed | 0.0489 | 0.070 |
| - | |
| VIST MTL | 0.0242 | 0.035 | 0.103 | - | |
|
|
|
| 0.220 | - | |
| UserStudy | KeywordScore | 0.0631 | 0.296 | 0.645 | 2 |
| PubMed | 0.0847 | 0.236 | 0.580 | 3 | |
| VIST MTL | 0.0571 | 0.239 | 0.407 |
| |
|
|
|
|
|
|
Low values are due to a small number of known PMIDs for individual queries. “# best Rel vs IrRel”: Number of queries for which the corresponding system has the best “Rel vs IrRel” score (27 queries for TREC PM 2017, 14 queries for UserStudy). *VIST SVM and VIST MTL are compared separately with KeywordScore and PubMed. KeywordScore is the ranking provided in the default settings of Solr
Fig. 5Evaluation results based on the UserStudy data set: Precision at k (P@k) and recall at k (R@k) of three different ranking schemes, i.e, PubMed, KeywordScore, and VIST SVM. Here, k refers to the k’th document in a ranked list that is also contained in the reference list