| Literature DB >> 34043624 |
Chun-Nan Hsu1, Chia-Hui Chang1,2, Thamolwan Poopradubsil2, Amanda Lo1, Karen A William1, Ko-Wei Lin1, Anita Bandrowski1,3, Ibrahim Burak Ozyurt1, Jeffrey S Grethe1, Maryann E Martone1,3.
Abstract
Antibodies are widely used reagents to test for expression of proteins and other antigens. However, they might not always reliably produce results when they do not specifically bind to the target proteins that their providers designed them for, leading to unreliable research results. While many proposals have been developed to deal with the problem of antibody specificity, it is still challenging to cover the millions of antibodies that are available to researchers. In this study, we investigate the feasibility of automatically generating alerts to users of problematic antibodies by extracting statements about antibody specificity reported in the literature. The extracted alerts can be used to construct an "Antibody Watch" knowledge base containing supporting statements of problematic antibodies. We developed a deep neural network system and tested its performance with a corpus of more than two thousand articles that reported uses of antibodies. We divided the problem into two tasks. Given an input article, the first task is to identify snippets about antibody specificity and classify if the snippets report that any antibody exhibits non-specificity, and thus is problematic. The second task is to link each of these snippets to one or more antibodies mentioned in the snippet. The experimental evaluation shows that our system can accurately perform the classification task with 0.925 weighted F1-score, linking with 0.962 accuracy, and 0.914 weighted F1 when combined to complete the joint task. We leveraged Research Resource Identifiers (RRID) to precisely identify antibodies linked to the extracted specificity snippets. The result shows that it is feasible to construct a reliable knowledge base about problematic antibodies by text mining.Entities:
Year: 2021 PMID: 34043624 PMCID: PMC8189493 DOI: 10.1371/journal.pcbi.1008967
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Example snippets of the antibody specificity classes and the PubMed IDs (PMID) of their sources.
| Class | Example | PMID |
|---|---|---|
| 30177812 | ||
| 25650666 | ||
| 27335636 |
Fig 1The workflow to construct Antibody Watch.
Given the article PMC6120938, a set of “RRID mention snippets” and “Specificity mention snippets” will be extracted. Next, a “Specificity classifier” will determine if a specificity mention snippet states that the antibody, in this case “the 6E10 antibody,” is specific to its target antigen or not. Then, “Antibody RRID linking” will link each specificity snippet to the “RRID mention snippets,” and thus to one or more exact antibodies. In this example, “the 6E10 antibody” is linked to the antibody with “RRID:AB_2564652,” which uniquely identifies an antibody. Finally, an entry is generated and entered into the Antibody Watch knowledge base to alert scientists that this antibody was reported to be nonspecific in PMC6120938 (PMID 30177812).
Fig 2A schematic diagram of the neural network architecture of (ABSA)2 for classifying antibody specificity snippets.
(ABSA)2 is an attention-over-attention model (AOA) based on ABSA [18] but with a transformer (left) as its input word embedding layer. (ABSA)2 takes a snippet and an aspect token “antibody” as the input to classify the snippet into one of the specificity classes.
Statistics of the annotated data.
We annotated 2,639 snippets for specificity classification (Task 1) and 7,245 snippet pairs for RRID linking (Task 2). Among 7,245, 1,100 are linked. Their specificity class distribution is shown (Joint).
| Task (unit) | Label Number | |||
|---|---|---|---|---|
| Task 1: Specificity Classification (snippets) | Nonspecific | Neutral | Specific | |
| 266 | 263 | 2,110 | 2,639 | |
| Task 2: RRID Linking (snippet pairs) | Yes | No | ||
| 1,100 | 6,145 | 7,245 | ||
| Joint (triples of RRID-specificity-snippet) | Nonspecific | Neutral | Specific | |
| 87 | 76 | 937 | 1,100 | |
Specificity classification performance comparison.
| Model | Specific | Non-specific | Neutral | Macro F1 | Weighted F1 |
|---|---|---|---|---|---|
| (ABSA)2 Models (Ours) | |||||
| AOA-SciBERT | 0.956 | 0.830 | 0.748 | 0.845 | 0.921 |
| 0.750 | |||||
| AOA-BiLSTM-SciBERT | 0.954 | 0.820 | 0.734 | 0.836 | 0.918 |
| AOA-BiLSTM-CLS-SciBERT | 0.956 | 0.794 | 0.839 | 0.921 | |
| AOA-MHA-SciBERT | 0.954 | 0.798 | 0.746 | 0.833 | 0.917 |
| AOA-MHA-CLS-SciBERT | 0.944 | 0.618 | 0.718 | 0.760 | 0.909 |
| SciBERT | 0.908 | 0.640 | 0.274 | 0.607 | 0.819 |
| SciBERT-SPC | 0.954 | 0.824 | 0.756 | 0.845 | 0.924 |
| BioBERT-SPC | 0.946 | 0.800 | 0.732 | 0.826 | 0.912 |
| AOA-BioBERT | 0.956 | 0.796 | 0.746 | 0.830 | 0.918 |
| AEN-SciBERT | 0.826 | 0.728 | 0.837 | 0.921 | |
| AEN-CLS-SciBERT | 0.956 | 0.812 | 0.740 | 0.836 | 0.920 |
| LCF-SciBERT | 0.952 | 0.812 | 0.724 | 0.829 | 0.916 |
| LCF-CLS-SciBERT | 0.956 | 0.826 | 0.746 | 0.843 | 0.922 |
Numbers are F1; in bold fonts are the best results. Macro F1 is the average of F1 of all classes. Weighted F1 is the average of F1 weighted by the number of instances of each class.
RRID-linking performance comparison.
| Model | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|
| BERT-SPC | 0.830 | 0.859 | 0.844 | 0.952 |
| BioBERT-SPC | 0.844 | 0.874 | 0.858 | 0.956 |
| Baseline (0.8) | 0.579 | 0.633 | 0.605 | 0.483 |
| Baseline (0.9) | 0.591 | 0.665 | 0.626 | 0.600 |
| Baseline (1.0) | 0.603 | 0.671 | 0.635 | 0.689 |
| BiLSTM+Manhattan | 0.502 | 0.506 | 0.504 | 0.696 |
| BiLSTM+Euclidean | 0.522 | 0.536 | 0.529 | 0.664 |
Numbers in bold fonts are the best results.
Complete workflow performance.
| Class | Truth | Predicted | Precision | Recall | F1 |
|---|---|---|---|---|---|
| Nonspecific | 87 | 101 | 0.802 | 0.931 | 0.862 |
| Neutral | 76 | 81 | 0.728 | 0.776 | 0.752 |
| Specific | 937 | 924 | 0.938 | 0.925 | 0.932 |
| Total/Macro | 1,100 | 1,106 | 0.823 | 0.878 | |
| Weighted | 0.913 | 0.915 |
The last second row shows the totals for the ground truth (Truth) and the predicted (Predicted) numbers and the macro averages of the metrics. The last row shows the weighted metrics as defined in the footnote of Table 3.