| Literature DB >> 35045882 |
Roberto Zanoli1, Alberto Lavelli1, Theresa Löffler2, Nicolas Andres Perez Gonzalez3, Fabio Rinaldi4,5,6,7.
Abstract
BACKGROUND: Melanoma is one of the least common but the deadliest of skin cancers. This cancer begins when the genes of a cell suffer damage or fail, and identifying the genes involved in melanoma is crucial for understanding the melanoma tumorigenesis. Thousands of publications about human melanoma appear every year. However, while biological curation of data is costly and time-consuming, to date the application of machine learning for gene-melanoma relation extraction from text has been severely limited by the lack of annotated resources.Entities:
Keywords: Annotated Dataset; Deep Learning; Machine Learning; Melanoma; Relation Extraction
Mesh:
Year: 2022 PMID: 35045882 PMCID: PMC8772125 DOI: 10.1186/s13326-021-00251-3
Source DB: PubMed Journal: J Biomed Semantics
Fig. 1Number of publications on melanoma disease per year in PubMed
Fig. 2The MGDB Basic Information page reports a description of the annotated genes (in this case APAF-1)
Fig. 3The snippets associated to a gene (in this case APAF-1) contain the text-evidence to support the relation between the gene and melanoma
MGDB contains 1,272 relations at concept-level. ≥1,403 at mention-level
| Genes | PMID | Snippets | Relations | |
|---|---|---|---|---|
| concept level | mention level | |||
| 527 | 910 | 1,403 | 1,272 | ≥1,403 |
Fig. 4Concept-level: there is one relation between gene 〈ID: 30014 〉 and melanoma 〈ID: D008545 〉. Mention-level: there are three relations between three mentions (SPANX, SPANX, sperm protein associated with the nucleus) of gene 〈ID: 30014 〉 and two mentions (melanoma, melanoma) of disease 〈ID: D008545 〉
Number of relations in MGDB and of relations that have been maintained in the MGR base dataset
| MGDB | MGR base dataset | ||
|---|---|---|---|
| concept level | mention level | concept level | mention level |
| 1,272 | 1,403 | 1,244(97.80%) | 1,192(84.96%) |
Fig. 5PubTator annotation for article 〈PMID: 15986140 〉. Concept-level: gene 〈ID: 6774 〉 is related to melanoma 〈ID: D008545 〉. Mention-level: gene 〈ID: 6774 〉 at position START:10,END:15 is related to melanoma 〈ID: D008545 〉 at position START:24,END:32
Fig. 6Extracted relation between gene 〈ID: 5728 〉 and melanoma 〈ID: D008545 〉 for article 〈PMID: 10446968 〉
Number of mentions (concepts) and relations for the MGR base training and test set splits
| PMID | Mentions(ID) | Relations | ||
|---|---|---|---|---|
| gene | disease | concept level | mention level | |
| train(605) | 8,622(868) | 4,251(1) | 828 | 1076 |
| test(302) | 4,262(537) | 2,073(1) | 416 | 537 |
| all(907) | 12,884(1,127) | 6,324(1) | 1,244 | 1,613 |
Precision (Pr), Recall (Re), and F1 measure of the models (BioBERT, CNN, decision tree) and baselines (sentence- and abstract-level) calculated on the concept-level test set
| Models | Pr | Re | |
|---|---|---|---|
| BioBERT | 74.42 | 77.29(67.09) | 75.83 |
| CNN | 70.50 | 71.01(59.15) | 70.76 |
| decision tree | 66.27 | 67.87(44.18) | 67.06 |
| sentence-level | 49.68 | 94.93 | 65.23 |
| abstract-level | 35.75 | 100.00 | 52.67 |
In brackets the recall calculated on the mention-level test set
Number of mentions (concepts) and relations in the MGR extended dataset
| PMID | Mentions(ID) | Relations | Time [min] | |
|---|---|---|---|---|
| gene | disease | |||
| 89,137 | 418,613(6,839) | 276,539(1) | 16,215 | 104 |
Time measured on NVIDIA Tesla K80 GPU 24GB