| Literature DB >> 35361129 |
Stefano Marchesin1, Gianmaria Silvello2.
Abstract
BACKGROUND: Databases are fundamental to advance biomedical science. However, most of them are populated and updated with a great deal of human effort. Biomedical Relation Extraction (BioRE) aims to shift this burden to machines. Among its different applications, the discovery of Gene-Disease Associations (GDAs) is one of BioRE most relevant tasks. Nevertheless, few resources have been developed to train models for GDA extraction. Besides, these resources are all limited in size-preventing models from scaling effectively to large amounts of data.Entities:
Keywords: Biomedical Relation Extraction; Gene-Disease Association; Weak supervision
Mesh:
Year: 2022 PMID: 35361129 PMCID: PMC8973894 DOI: 10.1186/s12859-022-04646-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Per-relation statistics for TBGA
| Granularity | Split | Therapeutic | Biomarker | Genomic alterations | NA |
|---|---|---|---|---|---|
| Sentence-level | Train | 3139 | 20,145 | 32,831 | 122,149 |
| Validation | 402 | 2279 | 2306 | 15,206 | |
| Test | 384 | 2315 | 2209 | 15,608 | |
| Bag-level | Train | 2218 | 13,372 | 12,759 | 56,698 |
| Validation | 331 | 2019 | 1147 | 6994 | |
| Test | 308 | 2068 | 1122 | 6996 |
Statistics are reported separately for each data split. Columns represent, from left to right, the considered granularity level, the data split, and the number of instances and bags associated with Therapeutic, Biomarker, Genomic Alterations, and NA relations
Fig. 1The 20 most frequent genes, diseases, and GDAs within TBGA
Global statistics comparison between TBGA, EU-ADR [13], CoMAGC [15], PolySearch [14], GAD [27], and GDAE [28] datasets
| Dataset | Annotation | Instances | Publications | Inst.s/pub. | Genes | Diseases | Relations |
|---|---|---|---|---|---|---|---|
| CoMAGC | Manual | 821 | 408 | 2.01 | 538 | 3 | 15 |
| EU-ADR | Manual | 355 | 65 | 5.46 | 221 | 118 | 4 |
| PolySearch | Manual | 522 | 374 | 1.40 | 245 | 10 | 2 |
| GAD | Weak | 5329 | 4112 | 1.30 | 1139 | 535 | 3 |
| GDAE | Weak | 8000 | 5875 | 1.36 | 3635 | 1904 | 2 |
| TBGA | Weak | 218,973 | 134,059 | 1.63 | 11,784 | 9199 | 4 |
Columns represent, from left to right, the considered dataset, the type of annotation, the total number of instances and publications, the average number of instances per publication, as well as the total number of genes, diseases, and relations
Global statistics comparison between TBGA, BioRel [24], and DTI [10] datasets
| Dataset | Split | Instances | Bags | Inst.s/bag | Relations |
|---|---|---|---|---|---|
| BioRel | Train | 534,277 | 39,969 | 13.37 | 125 |
| Validation | 114,506 | 20,675 | 5.54 | ||
| Test | 114,565 | 20,756 | 5.52 | ||
| DTI | Train | 604,303 | 472,033 | 1.28 | 6 |
| Validation | 6133 | 4769 | 1.29 | ||
| Test | 6312 | 4817 | 1.31 | ||
| TBGA | Train | 178,264 | 85,047 | 2.10 | 4 |
| Validation | 20,193 | 10,491 | 1.92 | ||
| Test | 20,516 | 10,494 | 1.96 |
Statistics are reported separately for each data split. Columns represent, from left to right, the considered granularity level, the data split, the total number of instances and bags, the average number of instances per bag, as well as the total number of relations
RE models performance on TBGA dataset
| Model | Strategy | AUPRC | P@50 | P@100 | P@250 | P@500 | P@1000 |
|---|---|---|---|---|---|---|---|
| CNN | AVE | 0.422 | 0.760 | 0.744 | 0.696 | 0.625 | |
| ATT | 0.403 | 0.760 | 0.788 | 0.710 | 0.624 | ||
| PCNN | AVE | 0.426 | 0.744 | 0.720 | 0.664 | ||
| ATT | 0.404 | 0.760 | 0.750 | 0.744 | 0.700 | 0.628 | |
| BiGRU | AVE | 0.437 | 0.620 | 0.720 | 0.724 | 0.730 | 0.678 |
| ATT | 0.423 | 0.760 | 0.750 | 0.748 | 0.726 | 0.666 | |
| BiGRU-ATT | AVE | 0.419 | 0.740 | 0.740 | 0.748 | 0.694 | 0.615 |
| ATT | 0.390 | 0.680 | 0.760 | 0.756 | 0.702 | 0.631 | |
| BERE | AVE | 0.419 | 0.700 | 0.710 | 0.720 | 0.704 | 0.620 |
| ATT |
Columns represent, from left to right, the considered RE model, the aggregation strategy, the AUPRC score, as well as the P@50, P@100, P@250, P@500, and P@1000 scores. For each measure, bold values represent the best scores
Fig. 2Precision-Recall curves for RE models on TBGA dataset. RE models are evaluated using both aggregation strategies—that is, average-based (AVE) and attention-based (ATT). Therefore, precision-recall curves are plot for each aggregation strategy
Fig. 3Overview of the TBGA creation process. The process consists of four steps: (1) data acquisition; (2) data cleaning; (3) distant supervision; and (4) dataset generation
Fig. 4DisGeNET association type ontology. For each association type, we also report its SIO identifier
Global and per-relation statistics for data cleaning and dataset generation
| Granularity | Target | Raw | Data cleaning | Dataset generation | ||
|---|---|---|---|---|---|---|
| TS | DR | RN | DB | |||
| Global | Publications | 707,390 | 572,981 | 572,607 | 447,280 | 57,675 |
| Genes | 21,118 | 17,658 | 17,658 | 17,658 | 8827 | |
| Diseases | 23,433 | 17,032 | 17,023 | 17,023 | 6964 | |
| Therapeutic | Instances | 10,744 | 4132 | 3925 | 3925 | 3925 |
| Bags | 6872 | 2939 | 2857 | 2857 | 2,857 | |
| Biomarker | Instances | 1,530,072 | 1,080,089 | 1,075,327 | 580,053 | 24,739 |
| Bags | 605,826 | 460,334 | 460,276 | 383,358 | 17,459 | |
| Genomic Alterations | Instances | 849,472 | 531,601 | 516,630 | 516,630 | 37,346 |
| Bags | 289,693 | 202,548 | 202,045 | 202,045 | 15,028 | |
Columns represent, from left to right, the considered granularity level, the target item, the raw (initial) statistics, and the statistics after each Data Cleaning and Dataset Generation step. The steps are: TS, DR, RN, and DB