| Literature DB >> 35729494 |
Jehad Aldahdooh1,2, Markus Vähä-Koskela3, Jing Tang4, Ziaurrehman Tanoli5,6.
Abstract
BACKGROUND: Drug-target interactions (DTIs) are critical for drug repurposing and elucidation of drug mechanisms, and are manually curated by large databases, such as ChEMBL, BindingDB, DrugBank and DrugTargetCommons. However, the number of curated articles likely constitutes only a fraction of all the articles that contain experimentally determined DTIs. Finding such articles and extracting the experimental information is a challenging task, and there is a pressing need for systematic approaches to assist the curation of DTIs. To this end, we applied Bidirectional Encoder Representations from Transformers (BERT) to identify such articles. Because DTI data intimately depends on the type of assays used to generate it, we also aimed to incorporate functions to predict the assay format.Entities:
Keywords: BERT; BERT for biomedical data; Bidirectional encoder representations from transformers; Bioactivity data; Biomedical text mining; Drug repurposing; Drug target interaction prediction; Mining drug target interactions
Mesh:
Substances:
Year: 2022 PMID: 35729494 PMCID: PMC9214985 DOI: 10.1186/s12859-022-04768-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1Architecture for all the BERT models, where Wi represents input word token and Oi represents contextual embeddings at the output layer. The O[CLS] is first token of output sequence and contains class label
Prediction of drug-target like documents from PubMed articles. The third column shows the number of documents that contain either drug or protein entities as identified by PubTator. In contrast, the fourth column indicates the number of documents that contain both drug and protein entities
| BERT model | Predicted as drug-target articles | Articles containing drugs or proteins on PubTator | Articles containing both drugs and proteins on PubTator |
|---|---|---|---|
| BERT | 688,206 | 682,150 | 342,902 |
| SciBERT | 594,999 | 589,999 | 321,831 |
| BioBERT | 636,091 | 630,132 | 340,638 |
| BioMed-RoBERTa | 725,748 | 720,030 | |
| BlueBERT | 570,284 | 564,220 | 297,834 |
| Majority voting | 597,844 | 592,789 | 316,794 |
Bold value indicates the top result for a dataset
Fig. 2Workflow for identifying new articles containing drug-target bioactivity data
Accuracy of BERT models on three independent datasets. DrugProt is the dataset containing 2788 positive articles based on DTIs (positive class) and 1215 from negative articles class. Medline is a completely negative dataset, and ChEMBL is a completely positive dataset containing DTIs
| Dataset | Articles | BERT | SciBERT | BioBERT | BioMed-RoBERTa | BlueBERT | Majority voting |
|---|---|---|---|---|---|---|---|
| DrugProt | 4003 | 68 | 65.9 | 71.4 | 67.5 | 69.6 | |
| Medline | 55,056 | 99.7 | 98.6 | 75.2 | 99.9 | 100 | 100 |
| ChEMBL | 876 | 89.6 | 93.2 | 91.2 | 83.4 | 88.7 | 90.3 |
Bold values indicate the top results for a dataset
Fig. 3Top word frequencies for A Drug-target documents, and B Other biological documents
Fig. 4A Top 15 journals for the articles that are predicted as drug-target based articles, B Top 15 years for articles predicted as drug-target articles
The tenfold cross validation results for identifying assay formats
| BERT model | F1 macro | F1 micro |
|---|---|---|
| BERT | 81.0 ± 1 | 81.5 ± 1 |
| SciBERT | 85.2 ± 1.1 | 85.6 ± 1.1 |
| BioBERT | 86 ± 1.3 | 86.4 ± 1.3 |
| BioMed-RoBERTa | ||
| BlueBERT | 87.0 ± 1 | 87.5 ± 1 |
Bold values indicate the top results for a dataset