Literature DB >> 35361129

TBGA: a large-scale Gene-Disease Association dataset for Biomedical Relation Extraction.

Stefano Marchesin¹, Gianmaria Silvello².

Abstract

BACKGROUND: Databases are fundamental to advance biomedical science. However, most of them are populated and updated with a great deal of human effort. Biomedical Relation Extraction (BioRE) aims to shift this burden to machines. Among its different applications, the discovery of Gene-Disease Associations (GDAs) is one of BioRE most relevant tasks. Nevertheless, few resources have been developed to train models for GDA extraction. Besides, these resources are all limited in size-preventing models from scaling effectively to large amounts of data.
RESULTS: To overcome this limitation, we have exploited the DisGeNET database to build a large-scale, semi-automatically annotated dataset for GDA extraction. DisGeNET stores one of the largest available collections of genes and variants involved in human diseases. Relying on DisGeNET, we developed TBGA: a GDA extraction dataset generated from more than 700K publications that consists of over 200K instances and 100K gene-disease pairs. Each instance consists of the sentence from which the GDA was extracted, the corresponding GDA, and the information about the gene-disease pair.
CONCLUSIONS: TBGA is amongst the largest datasets for GDA extraction. We have evaluated state-of-the-art models for GDA extraction on TBGA, showing that it is a challenging and well-suited dataset for the task. We made the dataset publicly available to foster the development of state-of-the-art BioRE models for GDA extraction.

Entities: Chemical

Keywords: Biomedical Relation Extraction; Gene-Disease Association; Weak supervision

Mesh：

Year: 2022 PMID： 35361129 PMCID： PMC8973894 DOI： 10.1186/s12859-022-04646-6

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Curated databases, such as UniProt [1], DrugBank [2], CTD [3], IUPHAR/BPS [4], Reactome [5], OMIM [6], or COSMIC [7], are pivotal to the development of biomedical science. Such databases are usually populated and updated with expensive and time-consuming human effort [8], that slows down the biological knowledge discovery process. To overcome this limitation, Biomedical Information Extraction (BioIE) aims to shift population and curation processes to machines by developing effective computational tools that automatically extract meaningful facts from the vast unstructured scientific literature [9, 10]. Once extracted, machine-readable facts can be fed to downstream tasks to ease biological knowledge discovery. Among the various tasks, the discovery of Gene-Disease Associations (GDAs) is one of the most pressing challenges to advance precision medicine and drug discovery [11], as it helps to understand the genetic causes of diseases [12]. Thus, the automatic extraction and curation of GDAs is key to advance precision medicine research and provide knowledge to assist disease diagnostics, drug discovery, and therapeutic decision-making. Most datasets for GDA extraction are hand-labeled corpora [13-15]. Among them, EU-ADR [13] only contains a small portion of GDA instances, making it difficult to train robust RE models for GDA extraction. On the other hand, PolySearch [14] only focuses on ten specific diseases, which are not sufficient to develop comprehensive models. Similarly, CoMAGC [15] only comprises gene-cancer associations on prostate, breast, and ovarian cancers. Hence, all datasets lack enough GDA heterogeneity to train effective RE models. Furthermore, hand-labeling data is an expensive process requiring large amounts of time to expert biologists and, therefore, all of these datasets are limited in size. To address this limitation, distant supervision has been proposed [16]. Under the distant supervision paradigm, all the sentences mentioning the same pair of entities are labeled by the corresponding relation stored within a source database. The assumption is that if two entities participate in a relation, at least one sentence mentioning them conveys that relation. As a consequence, distant supervision generates a large number of false positives, since not all sentences express the relation between the considered entities. To counter false positives, the RE task under distant supervision can be modeled as a Multi-Instance Learning (MIL) problem [17-20]. With MIL, the sentences containing two entities connected by a given relation are collected into bags labeled with such relation. Grouping sentences into bags reduces noise, as a bag of sentences is more likely to express a relation than a single sentence. Thus, distant supervision alleviates manual annotation efforts, and MIL increases the robustness of RE models to noise. Since the advent of distant supervision, several datasets for RE have been developed under this paradigm for news and web domains [16, 18, 21, 22], and recently also for biomedical science [10, 23, 24]. The most relevant biomedical datasets are BioRel [24]—a large-scale dataset for domain-general Biomedical Relation Extraction (BioRE)—and DTI [10]—a large-scale dataset developed to extract Drug–Target Interactions (DTIs). However, despite the success of distant supervision for RE tasks, its evaluation is known to be flawed [25, 26]. In this regard, previous works either employ inconsistent and expensive approaches to manually evaluate a small sample of model predictions or test models directly on distant-labeled data—which are inherently noisy and can skew the model’s performance. Only recently some progress has been made towards enhancing distantly-supervised datasets with human annotations [25-28]. Regarding GDA datasets, Bravo et al. [27] developed a semi-automatically annotated corpus based on the (GAD) [29], a retired archive of human genetic association studies of complex diseases. GAD provides the sentence in which a GDA is stated, but omits the information on the exact location of the gene and the disease within such sentence. Thus, the authors were required to perform Named Entity Recognition (NER)—which inevitably introduces noise into the annotation pipeline—to identify genes and diseases within GAD sentences. Once identified, the authors kept those sentences where the gene and disease reflect a GDA annotated by GAD curators as positive or negative. Then, to store false GDAs—that is, GDAs where the gene and the disease co-occur within a sentence but are not semantically associated—Bravo et al. selected sentences with co-occurring genes and diseases that were not annotated by GAD curators as GDAs. Similarly, Nourani and Reshadat [28] exploited DisGeNET [12] to develop a semi-automatically annotated dataset for GDA extraction. DisGeNET is one of the largest available collections of genes and variants involved in human diseases, integrating data from expert-curated repositories, Genome-Wide Association Studies (GWAS) catalogs [30], animal models, and scientific literature. For each GDA, DisGeNET provides the publication(s) supporting the association, a representative sentence from each publication, the original source, as well as information on the gene and disease involved in the association. Hence, the authors kept the GDAs—and the corresponding sentences—coming from DisGeNET curated resources as true instances, whereas they obtained false GDAs through distant supervision by selecting sentences where co-occurring genes and diseases do not participate in any GDA within DisGeNET. However, despite the use of large source databases and distant supervision, both the produced datasets are limited in size and have not been designed for a MIL setting, which is the de facto standard for distantly-supervised datasets. To overcome the limited size of current manually or semi-automatically annotated GDA datasets, as well as the noisy nature of fully distantly-supervised BioRE datasets, we make the following contributions. First, we present TBGA, a novel large-scale, semi-automatically annotated dataset for GDA extraction based on DisGeNET. We chose DisGeNET as source database since it is one of the most comprehensive databases for GDAs [31], integrating several expert-curated resources, such as UniProt [1], CTD [3], and PsyGeNET [32]. Furthermore, DisGeNET spans several different types of GDAs, as opposed to other databases like OMIM [6], COSMIC [7], TTD [33], BioMuta and BioXpress [34], which only focus on specific GDA types. Specifically, we used the portion of DisGeNET with curated resources to make validation and test sets, whereas we used the rest for training. On the other hand, we generated false GDAs by selecting sentences where co-occurring genes and diseases do not participate in DisGeNET GDAs. Compared to the dataset developed by Bravo et al. [27], TBGA exploits DisGeNET—which is three orders of magnitude larger than GAD—to gather true GDAs as well as to generate false ones. Regarding the dataset by Nourani and Reshadat [28], TBGA fully exploits DisGeNET resources and does not limit to curated ones. In this way, all the available expert-curated resources can be used to build validation and test sets, making the produced dataset larger than previous attempts and more realistic than fully distantly-supervised datasets. As a side note, we do not compare TBGA to the fully distantly-supervised GDA dataset by Teng et al. [23] as the dataset is not publicly available. To the best of our knowledge, TBGA is the largest available dataset for GDA extraction. Secondly, we trained and tested several state-of-the-art RE models on TBGA to create a large and realistic benchmark for GDA extraction. We built models using OpenNRE [35], an open and extensible toolkit for Neural Relation Extraction (NRE). The choice of OpenNRE eases the re-use of the dataset and the models developed for this work to future researchers. Finally, we publicly release TBGA on Zenodo [36], whereas we store source code and scripts to train and test RE models in a publicly available GitHub repository [37]. Besides, thanks to the continuous growth of DisGeNET, the released dataset can be updated and expanded regularly.

Results

TBGA is the first large-scale, semi-automatically annotated dataset for GDA extraction. The dataset consists of three text files, corresponding to train, validation, and test sets, plus an additional JSON file containing the mapping between relation names and IDs. Each record in train, validation, or test files corresponds to a single GDA extracted from a sentence, and it is represented as a JSON object with the following attributes:If a sentence contains multiple gene-disease pairs, the corresponding GDAs are split into separate data records. text: sentence from which the GDA was extracted. relation: relation name associated with the given GDA. h: JSON object representing the gene entity, composed of: id: NCBI Entrez ID associated with the gene entity. name: NCBI official gene symbol associated with the gene entity. pos: list consisting of starting position and length of the gene mention within text. t: JSON object representing the disease entity, composed of: id: UMLS Concept Unique Identifier (CUI) associated with the disease entity. name: UMLS preferred term associated with the disease entity. pos: list consisting of starting position and length of the disease mention within text. Overall, TBGA contains over 200,000 instances and 100,000 bags. Table 1 reports per-relation statistics for the dataset. Notice the large number of Not Associated (NA) instances. Moreover, Fig. 1 depicts the 20 most frequent genes, diseases, and GDAs within TBGA. The most frequent genes are tumor suppressor genes, such as TP53 and CDKN2A, and (proto-)oncogenes, like EGFR and BRAF. Among the most frequent diseases, we have neoplasms such as breast carcinoma, lung adenocarcinoma, and prostate carcinoma. As a consequence, the most frequent GDAs are gene-cancer associations.

Table 1

Per-relation statistics for TBGA

Granularity	Split	Therapeutic	Biomarker	Genomic alterations	NA
Sentence-level	Train	3139	20,145	32,831	122,149
	Validation	402	2279	2306	15,206
	Test	384	2315	2209	15,608
Bag-level	Train	2218	13,372	12,759	56,698
	Validation	331	2019	1147	6994
	Test	308	2068	1122	6996

Statistics are reported separately for each data split. Columns represent, from left to right, the considered granularity level, the data split, and the number of instances and bags associated with Therapeutic, Biomarker, Genomic Alterations, and NA relations

Fig. 1

The 20 most frequent genes, diseases, and GDAs within TBGA

Per-relation statistics for TBGA Statistics are reported separately for each data split. Columns represent, from left to right, the considered granularity level, the data split, and the number of instances and bags associated with Therapeutic, Biomarker, Genomic Alterations, and NA relations The 20 most frequent genes, diseases, and GDAs within TBGA TBGA is two orders of magnitude larger than current available datasets for GDA extraction [13–15, 27, 28]. Moreover, TBGA focuses on different association types, whereas most of current datasets only consider positive, negative, or false GDAs. The only exception is CoMAGC [15], where relations focus on different aspects of the gene expression changes and their association with cancer. Therefore, training and then testing RE models on TBGA allows for a more fine-grained and realistic evaluation that helps building effective solutions for GDA extraction. Table 2 compares global statistics between TBGA, EU-ADR [13], CoMAGC [15], PolySearch [14], GAD [27], and GDAE [28] datasets.

Table 2

Global statistics comparison between TBGA, EU-ADR [13], CoMAGC [15], PolySearch [14], GAD [27], and GDAE [28] datasets

Dataset	Annotation	Instances	Publications	Inst.s/pub.	Genes	Diseases	Relations
CoMAGC	Manual	821	408	2.01	538	3	15
EU-ADR	Manual	355	65	5.46	221	118	4
PolySearch	Manual	522	374	1.40	245	10	2
GAD	Weak	5329	4112	1.30	1139	535	3
GDAE	Weak	8000	5875	1.36	3635	1904	2
TBGA	Weak	218,973	134,059	1.63	11,784	9199	4

Columns represent, from left to right, the considered dataset, the type of annotation, the total number of instances and publications, the average number of instances per publication, as well as the total number of genes, diseases, and relations

Global statistics comparison between TBGA, EU-ADR [13], CoMAGC [15], PolySearch [14], GAD [27], and GDAE [28] datasets Columns represent, from left to right, the considered dataset, the type of annotation, the total number of instances and publications, the average number of instances per publication, as well as the total number of genes, diseases, and relations On the other hand, compared to current large-scale, fully distantly-supervised BioRE datasets—i.e., BioRel [24] and DTI [10]—TBGA contains expert-curated data. Hence, TBGA represents a more accurate benchmark than fully distantly-supervised datasets where to train and test RE models—helping to understand the current status and future steps required to improve BioRE research [26]. Despite the use of expert-curated data, TBGA has a size comparable to that of fully distantly-supervised BioRE datasets. Besides, with the continuous growth of DisGeNET, the size of TBGA can further increase. Table 3 compares global statistics between TBGA, DTI [10], and BioRel [24] datasets.

Table 3

Global statistics comparison between TBGA, BioRel [24], and DTI [10] datasets

Dataset	Split	Instances	Bags	Inst.s/bag	Relations
BioRel	Train	534,277	39,969	13.37	125
	Validation	114,506	20,675	5.54
	Test	114,565	20,756	5.52
DTI	Train	604,303	472,033	1.28	6
	Validation	6133	4769	1.29
	Test	6312	4817	1.31
TBGA	Train	178,264	85,047	2.10	4
	Validation	20,193	10,491	1.92
	Test	20,516	10,494	1.96

Statistics are reported separately for each data split. Columns represent, from left to right, the considered granularity level, the data split, the total number of instances and bags, the average number of instances per bag, as well as the total number of relations

Global statistics comparison between TBGA, BioRel [24], and DTI [10] datasets Statistics are reported separately for each data split. Columns represent, from left to right, the considered granularity level, the data split, the total number of instances and bags, the average number of instances per bag, as well as the total number of relations

Discussion

Data validation

In order to validate TBGA, we conducted comprehensive experiments with state-of-the-art RE models under the Multi-Instance Learning (MIL) setting. MIL is the typical setting used for distantly-supervised RE, where sentences are divided into bags based on pairs of entities and the prediction of relations occurs at bag-level. For example, the following two instances compose the “ADM-Schizophrenia” bag, where the target relation is Biomarker. Instance 1: “Our data support that ADM may be associated with the pathophysiology of schizophrenia, although the cause of the association needs further study.” Instance 2: “These findings suggest the possible role of ADM and SEPX1 as biomarkers of schizophrenia.” Below, we first describe the experimental setup and then present the results.

Experimental setup

Datasets

We performed experiments on three different datasets: TBGA, DTI, and BioRel. We used TBGA as a benchmark to evaluate RE models for GDA extraction under the MIL setting. On the other hand, we used DTI and BioRel only to validate the soundness of our implementation of the baseline models.

Evaluation measures

We evaluated RE models using the Area Under the Precision-Recall Curve (AUPRC). AUPRC is a popular measure to evaluate distantly-supervised RE models, which has been adopted by OpenNRE [35] and used in several works, such as [10, 24]. For experiments on TBGA, we also computed Precision at k items (P@k) and plotted the precision-recall curves.

Aggregation strategies

We adopted two different sentence aggregation strategies to use RE models under the MIL setting: average-based (AVE) and attention-based (ATT) [38]. The average-based aggregation assumes that all sentences within the same bag contribute equally to the bag-level representation. In other words, the bag representation is the average of all its sentence representations. On the other hand, the attention-based aggregation represents each bag as a weighted sum of its sentence representations, where the attention weights are dynamically adjusted for each sentence.

Baseline models

We considered the main state-of-the-art RE models to perform experiments: CNN [39], PCNN [40], BiGRU [10, 24, 41], BiGRU-ATT [10, 42], and BERE [10]. A detailed description of these RE models, along with information on parameter settings and hyper-parameter tuning, can be found in Additional file 1.

Experimental results

We report the results for two different experiments. The first experiment aims to validate the soundness of the implementation of the considered RE models. To this end, we trained and tested the RE models on DTI and BioRel datasets, and we compared the AUPRC scores we obtained against those reported in the original works [10, 24]. For this experiment, we only compared the RE models and aggregation strategies that were used in the original works. The results and discussion of the experiment can be found in Additional file 2. The second experiment uses TBGA as a benchmark to evaluate RE models for GDA extraction. In this case, we trained and tested all the considered RE models using both aggregation strategies. For each RE model, we reported the AUPRC and P@k scores, and we plotted the precision-recall curve.

GDA benchmarking

Table 4 shows the AUPRC and P@k scores of RE models on TBGA, whereas Fig. 2 plots the corresponding precision-recall curves. Given the RE models performance and precision-recall curves, we make the following observations. Thus, the obtained results suggest that TBGA is a challenging dataset for GDA extraction and, in general, for BioRE.

Table 4

RE models performance on TBGA dataset

Model	Strategy	AUPRC	P@50	P@100	P@250	P@500	P@1000
CNN	AVE	0.422	0.780	0.760	0.744	0.696	0.625
CNN	ATT	0.403	0.780	0.760	0.788	0.710	0.624
PCNN	AVE	0.426	0.780	0.780	0.744	0.720	0.664
PCNN	ATT	0.404	0.760	0.750	0.744	0.700	0.628
BiGRU	AVE	0.437	0.620	0.720	0.724	0.730	0.678
BiGRU	ATT	0.423	0.760	0.750	0.748	0.726	0.666
BiGRU-ATT	AVE	0.419	0.740	0.740	0.748	0.694	0.615
BiGRU-ATT	ATT	0.390	0.680	0.760	0.756	0.702	0.631
BERE	AVE	0.419	0.700	0.710	0.720	0.704	0.620
BERE	ATT	0.445	0.780	0.780	0.800	0.764	0.709

Columns represent, from left to right, the considered RE model, the aggregation strategy, the AUPRC score, as well as the P@50, P@100, P@250, P@500, and P@1000 scores. For each measure, bold values represent the best scores

Fig. 2

Precision-Recall curves for RE models on TBGA dataset. RE models are evaluated using both aggregation strategies—that is, average-based (AVE) and attention-based (ATT). Therefore, precision-recall curves are plot for each aggregation strategy

The performances achieved by RE models on TBGA indicate a high complexity of the GDA extraction task. When recall is smaller than 0.1, all RE models have precision greater than 0.7. However, at higher recall values, models performance decrease sharply. In particular, when recall is greater than 0.4, no RE model achieves precision values greater than or equal to 0.5. The task complexity is further supported by the lower performances obtained by top-performing RE models on TBGA compared to DTI and BioRel (cf. Additional file 2: Table S2). CNN, PCNN, BiGRU, and BiGRU-ATT RE models behave similarly. Among them, BiGRU-ATT has the worst performance. This suggests that replacing BiGRU max pooling layer with an attention layer proves less effective. Overall, the best AUPRC and P@k scores are achieved by BERE when using the attention-based aggregation strategy. This highlights the effectiveness of fully exploiting sentence information from both semantic and syntactic aspects [10]. BERE top performance can also be observed by looking at its precision-recall curve, which remains constantly above the other curves up to recall 0.4, where it stabilizes with the others. Nevertheless, most of RE models—regardless of the considered aggregation strategy—show precision drops at early recall values, not greater than 0.4. In terms of AUPRC, the attention-based aggregation proves less effective than the average-based one. On the other hand, attention-based aggregation provides mixed results on P@k measures. Although in contrast with the results obtained in general-domain RE [38], this trend is in line with the results found by Xing et al. [24] on BioRel, where RE models using an average-based aggregation strategy achieve performance comparable to or higher than those using an attention-based one. The only exception is BERE, whose performance using the attention-based aggregation outperforms the one using the average-based strategy. RE models performance on TBGA dataset Columns represent, from left to right, the considered RE model, the aggregation strategy, the AUPRC score, as well as the P@50, P@100, P@250, P@500, and P@1000 scores. For each measure, bold values represent the best scores Precision-Recall curves for RE models on TBGA dataset. RE models are evaluated using both aggregation strategies—that is, average-based (AVE) and attention-based (ATT). Therefore, precision-recall curves are plot for each aggregation strategy

Re-use potential

TBGA complies with the format required by OpenNRE [35] to train and test RE models. We chose to structure the dataset in this way to ease its re-use to future researchers. OpenNRE already provides several RE models that can be used directly on TBGA. In addition, we have also used OpenNRE to implement widely-used missing RE models. We used TBGA as a benchmark to evaluate RE models under the MIL setting—which is the typical setting for the RE task under distant supervision. In other words, we trained and tested RE models at bag-level. However, TBGA contains sentence-level expert-curated annotations in validation and test sets. Thus, researchers can also use TBGA to train RE models at bag-level and evaluate them on sentence-level expert-curated data—which is an emerging setting for distantly-supervised, manually enhanced datasets [25, 26]. To this end, no format changes are required to make TBGA compliant with the alternative setting.

Conclusions

We have presented a large-scale, semi-automatically annotated dataset for Gene-Disease Association (GDA) extraction. Automatic GDA extraction is one of the most relevant tasks of BioRE. We have used TBGA as a benchmark to evaluate state-of-the-art BioRE models on GDA extraction. The results suggest that TBGA is a challenging dataset for this task. Besides, the large size of TBGA—along with the presence of expert-curated annotations in its validation and test sets—makes it more realistic than fully distantly-supervised BioRE datasets.

Methods

The process to create TBGA consisted of four steps: data acquisition, data cleaning, distant supervision, and dataset generation. Figure 3 illustrates the overall procedure.

Fig. 3

Overview of the TBGA creation process. The process consists of four steps: (1) data acquisition; (2) data cleaning; (3) distant supervision; and (4) dataset generation

Data acquisition

The data used to generate TBGA comes from DisGeNET [12]. DisGeNET collects data on genotype-phenotype relationships from several resources and covers most of human diseases, including Mendelian, complex, environmental and rare diseases, as well as disease-related traits. According to the type of resource, DisGeNET organizes gene-disease data into one of four categories: Curated, Animal Models, Inferred, and Literature. Curated data contains GDA provided by expert-curated resources; Animal Models data includes GDA from resources containing information about rat and mouse models of disease; Inferred data refers to GDAs inferred from the Human Phenotype Ontology (HPO) [43] and from Variant-Disease Associations (VDAs); and Literature data provides GDAs extracted from the scientific literature using text-mining techniques [27, 44, 45]. For a seamless integration of such GDAs, DisGeNET classifies them by different association types, which are defined in the DisGeNET association type ontology. A detailed description of each association type can be found on the DisGeNET platform [46]. Figure 4 depicts the DisGeNET association type ontology, where we also report the Semanticscience Integrated Ontology (SIO) [47] identifiers of the different association types.

Fig. 4

DisGeNET association type ontology. For each association type, we also report its SIO identifier

DisGeNET association type ontology. For each association type, we also report its SIO identifier We acquired data from DisGeNET v7.0 to build TBGA. This version of DisGeNET contains 1,134,942 GDAs, involving 21,671 genes and 30,170 diseases, disorders, traits, and clinical or abnormal human phenotypes. We accessed DisGeNET data through the web interface [46], where we used the Browse functionality to retrieve GDAs along with supporting evidence. We gathered data from all four resource categories. Moreover, we filtered out data with no PubMed IDentifier (PMID) to avoid retrieving GDAs without a sentence supporting the association.

Data cleaning

The data acquired from DisGeNET underwent a data cleaning process. First, we filtered data based on the presence of tags surrounding the gene and disease mentions within sentences. In other words, we restricted to GDAs having representative sentences where the gene and the disease are highlighted. Then, we stripped gene and disease tags from text and we stored the exact location of gene and disease mentions within sentences. Since DisGeNET integrates data from various resources, there might be duplicate evidence for the same GDA. In this case, we discarded duplicates and prioritized data coming from expert-curated resources. From each instance resulting from the data cleaning process, we considered the following attributes: the original source, the publication supporting the association, the representative sentence, the association type, as well as information on the gene and disease involved in the association. Regarding genes, we kept the NCBI Entrez [48] identifiers, the NCBI official gene symbols, and the gene locations within sentences. As for diseases, we stored the UMLS [49] CUIs, the UMLS preferred terms, and the disease locations in text.

Distant supervision

To effectively train RE models, false GDAs are also required—i.e., instances where co-occurring genes and diseases are not semantically associated. However, DisGeNET stores only true GDAs. To overcome this limitation, we used distant supervision [16] to obtain false GDAs from the sentences contained within the abstract or title of the PubMed articles that support the GDAs retrieved in the data acquisition process. To this end, we relied on the 3.6.2rc6 version of MetaMapLite [50], a near real-time NER tool that identifies UMLS concepts within biomedical text. MetaMapLite returns, among other information, the CUI, the preferred term, and the location in text of the identified UMLS concepts. Thus, we used MetaMapLite to identify gene and disease UMLS concepts within sentences. For each identified concept, we stored its CUI, preferred term, and location in text. Then, we performed the following steps to generate false GDAs. For each instance generated through distant supervision, we kept the following attributes: the publication and sentence from which the false GDA has been extracted, the NA association type, and information on the co-occurring gene and disease. For genes, we first mapped UMLS CUIs to NCBI Entrez IDs, and then we stored them together with NCBI official gene symbols and gene locations in text. On the other hand, for diseases, we stored UMLS CUIs, UMLS preferred terms, as well as disease locations in text. We restricted to sentences where the co-occurring genes and diseases come from DisGeNET. The search for false GDAs among the genes and diseases of DisGeNET aimed to reduce false negatives and to obtain gene-disease pairs that were more likely not to be semantically associated. We filtered out instances where gene mentions matched common words. For instance, when all letters are in uppercase, the words FOR and TYPE are, by convention [51], aliases for the WWOX and SGCG genes. Therefore, when the gene mentions identified by MetaMapLite matched such (and other) common words, we kept the corresponding instances only if the matched words were in uppercase. As common words, we considered the set of most frequent words provided by Peter Norvig [52], which were derived from the Google Web Trillion Word Corpus [53]. We used the 2020AA UMLS MRCONSO file [54] to build a disease dictionary that stored UMLS preferred terms, lexical variants, alternate forms, short forms, and synonyms of the DisGeNET diseases. The MRCONSO file contains one row for each occurrence of each unique string or concept name within each source vocabulary of the UMLS Metathesaurus. Thus, we only kept instances where disease mentions exact-matched dictionary terms. In this way, we removed partial matches identified by MetaMapLite and, as a consequence, we reduced erroneous disease mentions. Of the remaining instances, we only took those whose gene-disease pairs did not belong to any GDA within DisGeNET and we labeled them as NA.

Dataset generation

The sets of true and false instances obtained from the data cleaning and distant supervision processes were used to generate TBGA. We considered different associations from the DisGeNET association type ontology to build the dataset. Specifically, we adopted the Therapeutic, Biomarker, and Genomic Alterations associations types as relations. Instead, we did not consider the Altered Expression and Post-translational Modification association types—although at the same level of Genomic Alterations—as we lacked curated data for them. In addition to true associations, we also considered the false association NA. The steps required to create TBGA were the following: We provide statistics regarding the different steps of data cleaning and dataset generation for true instances in Table 5. As for NA statistics, we performed distant supervision on more than 700,000 publications, obtaining 152,963 instances and 70,688 bags—which are associated with 83,501 publications and involve 9167 different genes and 5151 different diseases.

Table 5

Global and per-relation statistics for data cleaning and dataset generation

Granularity	Target	Raw	Data cleaning		Dataset generation
Granularity	Target	Raw	TS	DR	RN	DB
Global	Publications	707,390	572,981	572,607	447,280	57,675
	Genes	21,118	17,658	17,658	17,658	8827
	Diseases	23,433	17,032	17,023	17,023	6964
Therapeutic	Instances	10,744	4132	3925	3925	3925
Therapeutic	Bags	6872	2939	2857	2857	2,857
Biomarker	Instances	1,530,072	1,080,089	1,075,327	580,053	24,739
Biomarker	Bags	605,826	460,334	460,276	383,358	17,459
Genomic Alterations	Instances	849,472	531,601	516,630	516,630	37,346
Genomic Alterations	Bags	289,693	202,548	202,045	202,045	15,028

Columns represent, from left to right, the considered granularity level, the target item, the raw (initial) statistics, and the statistics after each Data Cleaning and Dataset Generation step. The steps are: TS, DR, RN, and DB

We performed a normalization process to convert DisGeNET association types to TBGA relations. In this regard, given the hierarchical structure of the DisGeNET association type ontology, we could normalize finer association types to their coarser ancestors. For instance, a Genetic Variation association is also a Genomic Alterations one, which, in turn, is a Biomarker association (cf. Fig. 4). Thus, we mapped association types finer than Genomic Alterations to Genomic Alterations itself. On the other hand, instances involving the same gene-disease pair from the same sentence can have Biomarker or Genomic Alterations association types depending on the considered resource. This situation occurs because instances are generated by different biologists or using different text-mining techniques. In these cases, we removed the instances associated with Biomarker to keep gene-disease pairs associated with Genomic Alterations, which represents a finer—and thus more precise—association type than Biomarker. We divided true instances among training, validation, and test sets based on the resource category. We used Curated data for validation and test, whereas Animal Models, Inferred, and Literature data for training. The only exception was Therapeutic, where we lacked enough data for training. In this case, we also used Curated data for training, setting an 80/10/10 ratio among training, validation, and test sets. We balanced the number of true instances among the dataset relations. For Biomarker and Genomic Alterations, we split Curated data evenly between validation and test. Then, we kept the same ratio that exists among relations in validation and test sets also in training. Since we model the BioRE task as a MIL problem, we downsampled over-represented relations—i.e., Biomarker and Genomic Alterations—at the bag-level rather than at the sentence-level to obtain the desired ratio among relations. We want TBGA to reflect the sparseness of GDAs in biomedical literature. Assuming we randomly sample gene and disease mentions from a sentence of a given scientific article, it is very likely that no association occurs between them. Therefore, similar to previous works [10, 24], we included a large number of false instances into training, validation, and test sets to make TBGA sparse. For each set, we sampled a number of NA bags twice the number of bags associated with true relations. We removed from the training set the bags whose gene-disease pairs also belong to validation and test sets. This operation avoids to introduce bias at inference time, as RE models cannot exploit training knowledge on the gene-disease pair. Global and per-relation statistics for data cleaning and dataset generation Columns represent, from left to right, the considered granularity level, the target item, the raw (initial) statistics, and the statistics after each Data Cleaning and Dataset Generation step. The steps are: TS, DR, RN, and DB Additional file 1. BioRE models description and settings. Detailed description of the considered RE models, along with information on parameter settings andhyper-parameter tuning. Additional file 2. Baselines validation. Results and discussion of the experiment performed to validate the soundness of the implementation of theconsidered RE models.

29 in total

1. The genetic association database.

Authors: Kevin G Becker; Kathleen C Barnes; Tiffani J Bright; S Alex Wang
Journal: Nat Genet Date: 2004-05 Impact factor: 38.330

2. MetaMap Lite: an evaluation of a new Java implementation of MetaMap.

Authors: Dina Demner-Fushman; Willie J Rogers; Alan R Aronson
Journal: J Am Med Inform Assoc Date: 2017-07-01 Impact factor: 4.497

3. Association extraction from biomedical literature based on representation and transfer learning.

Authors: Esmaeil Nourani; Vahideh Reshadat
Journal: J Theor Biol Date: 2019-12-25 Impact factor: 2.691

4. Guidelines for human gene nomenclature.

Authors: Elspeth A Bruford; Bryony Braschi; Paul Denny; Tamsin E M Jones; Ruth L Seal; Susan Tweedie
Journal: Nat Genet Date: 2020-08 Impact factor: 38.330

5. Reactome: a knowledgebase of biological pathways.

Authors: G Joshi-Tope; M Gillespie; I Vastrik; P D'Eustachio; E Schmidt; B de Bono; B Jassal; G R Gopinath; G R Wu; L Matthews; S Lewis; E Birney; L Stein
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

6. Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research.

Authors: Àlex Bravo; Janet Piñero; Núria Queralt-Rosinach; Michael Rautschka; Laura I Furlong
Journal: BMC Bioinformatics Date: 2015-02-21 Impact factor: 3.169

7. Therapeutic target database update 2018: enriched resource for facilitating bench-to-clinic research of targeted therapeutics.

Authors: Ying Hong Li; Chun Yan Yu; Xiao Xu Li; Peng Zhang; Jing Tang; Qingxia Yang; Tingting Fu; Xiaoyu Zhang; Xuejiao Cui; Gao Tu; Yang Zhang; Shuang Li; Fengyuan Yang; Qiu Sun; Chu Qin; Xian Zeng; Zhe Chen; Yu Zong Chen; Feng Zhu
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

8. Exploration of databases and methods supporting drug repurposing: a comprehensive survey.

Authors: Ziaurrehman Tanoli; Umair Seemab; Andreas Scherer; Krister Wennerberg; Jing Tang; Markus Vähä-Koskela
Journal: Brief Bioinform Date: 2021-03-22 Impact factor: 11.622

9. The Comparative Toxicogenomics Database (CTD).

Authors: Carolyn J Mattingly; Glenn T Colby; John N Forrest; James L Boyer
Journal: Environ Health Perspect Date: 2003-05 Impact factor: 9.031

10. COSMIC: the Catalogue Of Somatic Mutations In Cancer.

Authors: John G Tate; Sally Bamford; Harry C Jubb; Zbyslaw Sondka; David M Beare; Nidhi Bindal; Harry Boutselakis; Charlotte G Cole; Celestino Creatore; Elisabeth Dawson; Peter Fish; Bhavana Harsha; Charlie Hathaway; Steve C Jupe; Chai Yin Kok; Kate Noble; Laura Ponting; Christopher C Ramshaw; Claire E Rye; Helen E Speedy; Ray Stefancsik; Sam L Thompson; Shicai Wang; Sari Ward; Peter J Campbell; Simon A Forbes
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

1 in total

1. Empowering digital pathology applications through explainable knowledge extraction tools.

Authors: Stefano Marchesin; Fabio Giachelle; Niccolò Marini; Manfredo Atzori; Svetla Boytcheva; Genziana Buttafuoco; Francesco Ciompi; Giorgio Maria Di Nunzio; Filippo Fraggetta; Ornella Irrera; Henning Müller; Todor Primov; Simona Vatrano; Gianmaria Silvello
Journal: J Pathol Inform Date: 2022-09-15

1 in total