| Literature DB >> 31796060 |
Jake Lever1,2, Martin R Jones1, Arpad M Danos3, Kilannin Krysiak3,4, Melika Bonakdar1, Jasleen K Grewal1,2, Luka Culibrk1,2, Obi L Griffith5,6,7,8, Malachi Griffith9,10,11,12, Steven J M Jones13,14,15.
Abstract
BACKGROUND: Precision oncology involves analysis of individual cancer samples to understand the genes and pathways involved in the development and progression of a cancer. To improve patient care, knowledge of diagnostic, prognostic, predisposing, and drug response markers is essential. Several knowledgebases have been created by different groups to collate evidence for these associations. These include the open-access Clinical Interpretation of Variants in Cancer (CIViC) knowledgebase. These databases rely on time-consuming manual curation from skilled experts who read and interpret the relevant biomedical literature.Entities:
Keywords: Cancer biomarkers; Information extraction; Machine learning; Precision oncology; Text mining
Mesh:
Substances:
Year: 2019 PMID: 31796060 PMCID: PMC6891984 DOI: 10.1186/s13073-019-0686-y
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
The five groups of search terms used to identify sentences that potentially discussed the four evidence types. Strings such as “sensitiv” are used to capture multiple words including “sensitive” and “sensitivity”
| General | Diagnostic | Predictive | Predisposing | Prognostic |
|---|---|---|---|---|
| marker | diagnostic | sensitiv | risk | survival |
| resistance | predispos | prognos | ||
| efficacy | DFS | |||
| predict |
Fig. 1a A screenshot of the annotation platform that allowed expert annotators to select the relation types for different candidate relations in all of the sentences. The example sentence shown describes a prognostic marker. b An overview of the annotation process. Sentences are identified from literature that describes cancers, genes, variants, and optionally drugs before being filtered using search terms. The first test phase tried complex annotation of biomarker and variants together but was unsuccessful. The annotation task was split into two separate tasks for biomarkers and variants separately. Each task had a test phase and then the main phase on the 800 sentences that were used to create the gold set
Fig. 2a The precision-recall curves illustrate the performance of the five relation extraction models built for the four evidence types and the associated variant prediction. b This same data can be visualized in terms of the threshold values on the logistic regression to select the appropriate value for high precision with reasonable recall
The inter-annotator agreement for the main phase for 800 sentences, measured with F1-score, showed good agreement in the two sets of annotations for biomarkers as well as very high agreement in the variant annotation task. The sentences from the multiple test phases are not included in these numbers and were discarded from further analysis
| Annotator 2 | Annotator 3 | |
|---|---|---|
| Annotator 1 | 0.74 | 0.73 |
| Annotator 2 | NA | 0.74 |
| Annotator 1 | 0.78 | 0.85 |
| Annotator 2 | NA | 0.79 |
| Annotator 1 | 0.96 | 0.96 |
| Annotator 2 | NA | 0.96 |
Number of annotations in the training and test sets
| Annotation | Train | Test |
|---|---|---|
| Associated variant | 768 | 270 |
| Diagnostic | 156 | 62 |
| Predictive | 147 | 43 |
| Predisposing | 125 | 57 |
| Prognostic | 232 | 88 |
The selected thresholds for each relation type with the high precision and lower recall trade-off
| Extracted relation | Threshold | Precision | Recall |
|---|---|---|---|
| Associated variant | 0.70 | 0.941 | 0.794 |
| Diagnostic | 0.63 | 0.957 | 0.400 |
| Predictive | 0.93 | 0.891 | 0.141 |
| Predisposing | 0.86 | 0.837 | 0.218 |
| Prognostic | 0.65 | 0.878 | 0.414 |
Fig. 3a A Shiny-based web interface allows for easy exploration of the CIViCmine biomarkers with filters and overview pie charts. The main table shows the list of biomarkers and links to a subsequent table showing the list of supporting sentences. b The entirety of PubMed and PubMed Central Open Access subset were processed to extract 87,412 biomarkers distributed between the four different evidence types shown. c Protein-coding variants extracted for each evidence item are compared against somatic variants in COSMIC and > 1% prevalence SNPs in dbSNP
Four example sentences for the four evidence types extracted by CIViCmine. The associated PubMed IDs are also shown for reference
| Type | PMID | Sentence |
|---|---|---|
| Diagnostic | 29214759 | JAK2 V617F is the most common mutation in myeloproliferative neoplasms (MPNs) and is a major diagnostic criterion. |
| Predictive | 28456787 | In non-small cell lung cancer (NSCLC) driver mutations of EGFR are positive predictive biomarkers for efficacy of erlotinib and gefitinib. |
| Predisposing | 28222693 | Our study suggests that one BRCA1 variant may be associated with increased risk of breast cancer. |
| Prognostic | 28469333 | Overexpression of Her2 in breast cancer is a key feature of pathobiology of the disease and is associated with poor prognosis. |
Fig. 4The top 20 a genes, b cancer types, c drugs, and d variants extracted as part of evidence items
Fig. 5a A comparison of the associations and papers in CIViCmine with CIViC, the Cancer Genome Interpreter and OncoKB. b The top results in CIViCmine were evaluated by a CIViC curator and measured for three categories (correctness, usability, and need). Percentages are shown for each metric and evidence type for no, intermediate, and yes