| Literature DB >> 30576491 |
Qingyu Chen1, Nagesh C Panyam1, Aparna Elangovan1, Karin Verspoor1.
Abstract
Precision medicine aims to provide personalized treatments based on individual patient profiles. One critical step towards precision medicine is leveraging knowledge derived from biomedical publications-a tremendous literature resource presenting the latest scientific discoveries on genes, mutations and diseases. Biomedical natural language processing (BioNLP) plays a vital role in supporting automation of this process. BioCreative VI Track 4 brings community effort to the task of automatically identifying and extracting protein-protein interactions (PPi) affected by mutations (PPIm), important in the precision medicine context for capturing individual genotype variation related to disease.We present the READ-BioMed team's approach to identifying PPIm-related publications and to extracting specific PPIm information from those publications in the context of the BioCreative VI PPIm track. We observe that current BioNLP tools are insufficient to recognise entities for these two tasks; the best existing mutation recognition tool achieves only 55% recall in the document triage training set, while relation extraction performance is limited by the low recall performance of gene entity recognition. We develop the models accordingly: for document triage, we develop term lists capturing interactions and mutations to complement BioNLP tools, and select effective features via a feature contribution study, whereas an ensemble of BioNLP tools is employed for relation extraction.Our best document triage model achieves an F-score of 66.77% while our best model for relation extraction achieved an F-score of 35.09% over the final (updated post-task) test set. Impacting the document triage task, the characteristics of mutations are statistically different in the training and testing sets. While a vital new direction for biomedical text mining research, this early attempt to tackle the problem of identifying genetic variation of substantial biological significance highlights the importance of representative training data and the cascading impact of tool limitations in a modular system.Entities:
Mesh:
Year: 2018 PMID: 30576491 PMCID: PMC6301335 DOI: 10.1093/database/bay122
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1(a) Distribution of documents having a mutation identified by at least one of four mutation detection tools. Relevant: documents labelled as relevant for PPIm in the training set, otherwise non-relevant. Mutations found: documents have a mutation mention identified by at least one of the tools, otherwise no mutations found. The y-axis corresponds to the proportion relative to the relevant or non-relevant document collections, respectively. For instance, almost 70% of non-relevant documents (1628 out of 2353) have no detected mutation mentions. (b) Distribution of relevant documents having mutations identified by individual tools; y = 1729 is the total number of relevant documents, in which we would expect to have at least one mutation mentioned per task definition.
Figure 2Distribution of the probability of documents having interactions identified by PIE the search. The x-axis displays the probability score output by PIE the search (normalized to [0, 1]); a higher score indicates higher probability of having interaction. The y-axis reflects the number of documents. (a) and (b) represent the distribution for non-relevant (blue) and relevant documents (orange), respectively.
Figure 3Venn diagram showing the differences in the gene entity annotations over the PPIm dataset. An entity annotation is regarded as a triple (document id, character span, entity id). The three sets are the set of annotations given in the task dataset (Task) and from the two entity annotators namely Pubtator and GNormPlus (GNorm).
Distribution of entities as recognised by different entity annotators in the PPIm dataset
| Count | Annotator | Training set | Test set |
|---|---|---|---|
| Number of documents | - | 597 | 632 |
| Number of PPI relations | Task | 760 | 868 |
| Gene annotations | Task | 8832 | 1461 |
| GNorm | 8677 | 11229 | |
| Pubtator | 8801 | 11059 | |
| Annotator ensemble | 10897 | 11665 | |
| Mutations | Pubtator | 557 | 1722 |
| Species | Pubtator | 1102 | 948 |
Maximum recall achievable by our relation extraction system with different entity annotation schemes
| Entity annotation | Maximum recall achievable by relation extraction |
|---|---|
| GNormPlus | 55.95 |
| GNormPlus + Pubtator | 56.30 |
| Task annotations | 1.0 |
| Annotator ensemble | 1.0 |
Algorithm 1The algorithm measures the 'impact' of a sentence. If there is a co-occurrence relationship, identified either using intersection, union or complement as mentioned in Counting-based category, it firstly calculates the mutation and interaction scores by summing the total mentions and normalising to [0, 1]. Then it calculates the impact score based on the weight of mutation and interaction scores (α and β). The default weights are 0.5, meaning that mutations and interactions are equally important.
Detailed descriptions of the best feature sets found via the feature contribution study in Table 3
| Feature set | Feature ID | Description | Aspects |
|---|---|---|---|
| S1 | F1 | Number of interactions identified in total across the paragraph by term lists | Paragraph-Term-Individual-Total |
| F2 | Number of unique interactions identified across the paragraph by term lists | Paragraph-Term-Individual-Unique | |
| F3 | Number of mutations identified in total across the paragraph by term lists | Paragraph-Term-Individual-Total | |
| F4 | Number of unique mutations identified across the paragraph by term lists | Paragraph-Term-Individual-Unique | |
| S2 | F5 | Number of interactions and mutations in total across the paragraph by term lists if co-occurrence exists | Paragraph-Term-Occurrence-Total |
| F6 | Number of unique interactions and mutations across the paragraph by term lists if co-occurrence exists | Paragraph-Term-Occurrence-Unique | |
| S3 | F7 | Number of genes identified in total across the paragraph by BioNLP tools | Paragraph-BioConcept-Individual-Total |
| F8 | Number of unique genes identified across the paragraph by BioNLP tools | Paragraph-BioConcept-Individual-Unique | |
| F9 | Number of mutations identified in total across the paragraph by BioNLP tools | Paragraph-BioConcept-Individual-Total | |
| F10 | Number of unique mutations identified across the paragraph by BioNLP tools | Paragraph-BioConcept-Individual-Unique | |
| F11 | The probability of the paragraph containing interactions by BioNLP tools (using PIE the search) | Paragraph-BioConcept-Individual-Total (probability) | |
| S4 | F12 | Number of interactions and mutations in total across the paragraph by BioNLP tools | Paragraph-BioConcept-Occurrence-Total |
| S5 | F13 | Number of sentences containing mutations by term lists | Sentence-Term-Individual-Total |
| F14 | Number of sentences containing interactions by term lists | Sentence-Term-Individual-Total | |
| S6 | F15 | Number of sentences containing both interactions and mutations by term lists | Sentence-Term-Occurrence-Total |
| S7 | F16 | Number of sentences containing mutations by BioNLP tools | Sentence-BioConcept-Individual-Total |
| F17 | Number of sentences containing genes by BioNLP tools | Sentence-BioConcept-Individual-Total | |
| S8 | F18 | Number of sentences containing both interactions and mutations by BioNLP tools | Sentence-BioConcept-Occurrence-Total |
| S9 | F19 | Number of sentences containing both interactions and mutations either by term lists or BioNLP tools | Sentence-Both-Occurrence-Union |
| F20 | Number of sentences containing both interactions and mutations either by term lists or BioNLP tools using complementary approach | Sentence-Both-Occurrence-Complement | |
| S10 | F21 | Number of sentences containing mutation-impact-interaction triplets by term lists | Sentence-Term-Triplet-Total |
| F22 | Number of sentences containing mutation-impact-interaction triplets by term lists or BioNLP tools | Sentence-Both-Triplet-Union | |
| F23 | Number of sentences containing mutation-impact-interaction triplets by term lists or BioNLP tools or one another | Sentence-Both-Triplet-Complement | |
| S11 | F24 | Same as F17, but on number of sentence 2-grams | Sentence-Term-Triplet-Total |
| F25 | Same as F18, but on number of sentence 2-grams | Sentence-Both-Triplet-Union | |
| F26 | Same as F19, but on number of sentence 2-grams | Sentence-Both-Triplet-Complement | |
| F27 | Same as F22, but using average probability of sentences instead of number of sentences | Sentence-Both-Triplet-Complement (probability) |
Feature contribution study results. Each set of features in a row is added to the existing feature set; for example, term paragraph individual features represent the new features added to the baseline features. The description of a set is consistent with the description of feature aspects. 2-gram: every two sentences. Fold 1 represents the first fold using 10-fold cross-validation, same for other folds. Table 4 provides the detailed descriptions of the best feature set found in the feature contribution study
| Fold 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | F1 mean (±std) | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Baseline | 0.6390 | 0.6418 | 0.6457 | 0.6118 | 0.6435 | 0.6577 | 0.6611 | 0.6461 | 0.6286 | 0.5714 | 0.6347 (±0.0262) |
| Term paragraph individual features (S1) | 0.6796 | 0.6523 | 0.7120 | 0.6538 | 0.6832 | 0.7143 | 0.6610 | 0.7119 | 0.6991 | 0.6386 | 0.6806 (±0.0281) |
| Term paragraph co-occurrence features (S2) | 0.6839 | 0.6541 | 0.7065 | 0.6474 | 0.6814 | 0.7128 | 0.6629 | 0.7139 | 0.7012 | 0.6488 | 0.6813 (±0.0266) |
| BioNLP paragraph individual features (S3) | 0.6753 | 0.6542 | 0.7215 | 0.6538 | 0.6868 | 0.7363 | 0.6685 | 0.7155 | 0.7012 | 0.6667 | 0.6880 (±0.0292) |
| BioNLP paragraph co-occurrence features (S4) | 0.6753 | 0.6584 | 0.7215 | 0.6538 | 0.6923 | 0.7363 | 0.6722 | 0.7135 | 0.6972 | 0.6667 | 0.6887 (±0.0281) |
| Term sentence individual feature (S5) | 0.6753 | 0.6522 | 0.7139 | 0.6431 | 0.6904 | 0.7415 | 0.6685 | 0.7155 | 0.7012 | 0.6588 | 0.6860 (±0.0318) |
| Term sentence co-occurrence feature (S6) | 0.7386 | 0.6847 | 0.7538 | 0.6688 | 0.6885 | 0.7558 | 0.7193 | 0.7348 | 0.7169 | 0.6586 | 0.7120 (±0.0349) |
| BioNLP sentence individual feature (S7) | 0.7386 | 0.6828 | 0.7570 | 0.6731 | 0.6885 | 0.7609 | 0.7158 | 0.7348 | 0.7169 | 0.6606 | 0.7129 (±0.0353) |
| BioNLP sentence co-occurrence feature (S8) | 0.7320 | 0.6826 | 0.7468 | 0.6688 | 0.6921 | 0.7520 | 0.7123 | 0.7313 | 0.7169 | 0.6727 | 0.7108 (±0.0303) |
| Term & BioNLP sentence feature (S9) | 0.7284 | 0.6903 | 0.7558 | 0.6730 | 0.6957 | 0.7572 | 0.7033 | 0.7308 | 0.7234 | 0.6607 | 0.7118 (±0.0328) |
| Term & BioNLP sentence triplet feature (S10) | 0.7273 | 0.6825 | 0.7538 | 0.6730 | 0.6940 | 0.7461 | 0.7139 | 0.7534 | 0.7077 | 0.6786 | 0.7130 (±0.0311) |
| Term & BioNLP sentence triplet 2-gram feature (S11) | 0.7273 | 0.6807 | 0.7661 | 0.6752 | 0.6959 | 0.7532 | 0.7174 | 0.7520 | 0.7099 | 0.6806 | 0.7158 (±0.0332) |
| Full feature set plus boosting | 0.7571 | 0.7086 | 0.7646 | 0.7108 | 0.7059 | 0.7809 | 0.7500 | 0.7627 | 0.7251 | 0.6899 | 0.7355 (±0.0311) |
Figure 4Illustration of the (partial) dependency graph for the sentence `This result indicates that LAF1 and HFR1 function in largely independent pathways’. The entities (genes) are shown in blue.
The candidate examples generated for relation classification with different entity annotation schemes
| Type | Entity annotator | Training set | Testing set | ||
|---|---|---|---|---|---|
|
|
|
|
| ||
| Sentence-level relations | Task | 4115 | 532 | 264 | 152 |
| GNormPlus | 1709 | 4697 | 2829 | 5780 | |
| GNormPlus + Pubtator | 1783 | 4822 | 2865 | 5835 | |
| GNormPlus + Pubtator + Task | 4058 | 4533 | 3097 | 6030 | |
| Non-sentence level relations | Task | 412 | 162 | 556 | 178 |
| GNormPlus | 89 | 4616 | 321 | 10105 | |
| GNormPlus + Pubtator | 89 | 4734 | 338 | 10190 | |
| GNormPlus + Pubtator + Task | 430 | 5317 | 872 | 11719 | |
Document triage task performance over the training set using 10-fold cross-validation
| Model | Training time (sec) | Prediction time (sec) | Ranked precision | Precision | Recall | F1 |
|---|---|---|---|---|---|---|
| Baseline | 0.5885 (±0.0949) | 0.0006 (±0.0001) | 0.6839 (±0.0303) | 0.6485 (±0.0353) | 0.6229 (±0.0353) | 0.6347 (±0.0262) |
| LR (boosting) | 26.7562 (±4.0662) | 0.0500 (±0.0133) |
| 0.7058 (±0.0313) | 0.7684 (±0.0368) |
|
| SVM | 65.8464 (±0.8162) | 1.1460 (±0.0398) | 0.7479 (±0.0287) |
| 0.7102 (±0.0443) | 0.7095 (±0.0322) |
| RF | 829.0522 (±54.4509) | 8.4559 (±0.1722) | 0.7457 (±0.0370) | 0.6651 (±0.0322) |
| 0.7236 (±0.0227) |
Document triage task performance on the test set
| Model | Ranked precision | Precision | Recall | F1 |
|---|---|---|---|---|
| Baseline | 0.6329 | 0.5852 | 0.6733 | 0.6262 |
| LR (BOOSTING) |
| 0.5783 | 0.7713 | 0.6610 |
| SVM | 0.6721 |
| 0.7116 | 0.6473 |
| RF | 0.6744 | 0.5361 |
|
|
Relation Extraction Performance with different modes of entity annotation. Performance measurements for the training set are based on 10-fold cross validation. Best results for training and test sets are highlighted, for entity recognition with standard BioNLP tools
| Entity annotation | Relation extraction method | Training set | Test set | ||||
|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 | Precision | Recall | F1 | ||
| Oracle | ASM with sentence level relations only | 0.8472 | 0.8112 | 0.8288 | 0.7605 | 0.2923 | 0.4223 |
| + Non-sentence relation extraction | 0.8373 | 0.8830 | 0.8595 | 0.7632 | 0.8826 | 0.8186 | |
| + Relevance info | 0.8308 | 0.8883 | 0.8586 | 0.7665 | 0.8803 | 0.8195 | |
| + Mutation terms | 0.8346 | 0.8790 | 0.8562 | 0.7614 | 0.8849 | 0.8185 | |
| Co-occurrence method (N > = 3) | 0.8755 | 0.6263 | 0.7302 | 1.000 | 1.000 | 1.000 | |
| GNormPlus | ASM with sentence level relations only | 0.2997 | 0.2766 | 0.2877 | 0.3208 |
| 0.3409 |
| + Non-sentence relation extraction | 0.2987 | 0.2793 | 0.2887 | 0.3189 | 0.3636 | 0.3398 | |
| + Relevance info |
| 0.2753 | 0.2877 | 0.3277 | 0.3544 | 0.3405 | |
| + Mutation terms | 0.2965 | 0.2819 |
| 0.3277 | 0.3579 | 0.3421 | |
| Co-occurrence method (N > = 3) | 0.1583 |
| 0.2688 |
| 0.3098 | 0.3492 | |
| GNormPlus+Pubtator | ASM with sentence level relations only | 0.2832 | 0.2832 | 0.2832 | 0.3464 | 0.3556 |
|
| + Non-sentence relation extraction | 0.2821 | 0.2832 | 0.2827 | 0.3384 | 0.3567 | 0.3473 | |
| + Relevance info | 0.2808 |
| 0.2827 | 0.3467 | 0.3383 | 0.3425 | |
| + Mutation terms |
| 0.2819 |
| 0.3312 | 0.3544 | 0.3424 | |
| Co-occurrence method (N > = 3) | - | - | - | - | - | - | |
| GNormPlus+Pubtator +Oracle | ASM with sentence level relations only | 0.4066 | 0.7699 | 0.5322 | 0.3147 | 0.6030 | 0.4136 |
| + Non-sentence relation extraction | 0.4066 | 0.7726 | 0.5328 | 0.3118 | 0.6053 | 0.4116 | |
| + Relevance info | 0.4068 | 0.7686 | 0.5320 | 0.3107 | 0.6064 | 0.4109 | |
| + Mutation terms | 0.4052 | 0.7699 | 0.5309 | 0.3147 | 0.6018 | 0.4133 | |
| Co-occurrence method (N > = 3) | 0.3061 | 0.2487 | 0.2744 | 0.6623 | 0.9954 | 0.7954 | |
| Co-occurrence method (N > = 9) | 0.3961 | 0.1090 | 0.1710 | 0.9183 | 0.9954 | 0.9553 | |
Figure 5Impact of choice of threshold N = number of sentences containing a given protein pair on performance of heuristic co-occurrence approach on the test data set, for (a) automated protein/gene named entity recognition and (b) oracle protein named entity recognition scenarios.
Quantitative analysis of training and test sets in terms of the number of entities (mutations or interactions identified by BioNLP tools per document. Min, Q1 (25th percentile), Median, Q3 (75th percentile) and Max give the distributional characteristics. Mean and std show the characteristics on average. The numbers marked with * show the p-value is less than 0.00001 when conducting a z-test on two samples
| Entity | Class | Train vs Test | Min | Q1 | Median | Q3 | Max | Mean (Std) |
|---|---|---|---|---|---|---|---|---|
| Mutation | Relevant | Train (1729 documents) | 0 | 0 | 1 | 2 | 29 |
|
| Test (704 documents) | 0 | 1 | 2 | 4 | 24 |
| ||
| Non-relevant | Train (2353 documents) | 0 | 0 | 0 | 1 | 17 |
| |
| Test (723 documents) | 0 | 1 | 1 | 3 | 20 |
| ||
| Interaction | Relevant | Train | 0 | 7 | 15 | 22 | 66 | 15.6217 (10.9309) |
| Test | 0 | 9 | 16 | 24 | 71 | 16.7528 (10.8994) | ||
| Non-relevant | Train | 0 | 7 | 14 | 21 | 64 | 14.3366 (9.9784) | |
| Test | 0 | 8 | 15 | 23 | 55 | 15.7718 (11.0419) | ||
Figure 6The proportion of relevant documents containing individual mutation terms, across training and testing data sets. The terms (or the groups of terms) are mentioned in the model development section.
Figure 7The proportion of non-relevant documents containing individual mutation terms, across training and testing data sets. The terms (or the groups of terms) are mentioned in the model development section.
Figure 8A Venn representation on number of documents have been identified having mutations using a combination of BioNLP tools and term lists. The left side shows the results on the training set: (a) for relevant documents and (b) for non-relevant documents; the right side (c) and (d) shows the results for relevant documents and non-relevant documents respectively in the testing set.
Characteristics of erroneous and correct cases classified by boosting logistic regression over the training set. Perspective: B for BioNLP tools and T for term lists. The numbers are the number of mutations identified using B or T per document
| Category | # cases (%) | Perspective | Min | Q1 | Median | Q3 | Max | Mean | Std |
|---|---|---|---|---|---|---|---|---|---|
| FP | 552 (13.52%) | B | 0 | 0 | 1 | 2 | 15 | 1.6069 | 2.1651 |
| T | 0 | 1 | 2 | 4 | 16 | 2.8043 | 2.2291 | ||
| FN | 399 (9.77%) | B | 0 | 0 | 0 | 1 | 19 | 0.8647 | 1.8803 |
| T | 0 | 0 | 1 | 2 | 13 | 1.3559 | 1.6356 | ||
| TP | 1330 (32.58%) | B | 0 | 0 | 1 | 2 | 29 | 1.9173 | 2.8280 |
| T | 0 | 0 | 0 | 1 | 12 | 0.6863 | 1.3548 | ||
| TN | 1801 (44.13%) | B | 0 | 0 | 0 | 0 | 17 | 0.4281 | 1.2874 |
| T | 0 | 0 | 0 | 1 | 12 | 0.6863 | 1.3548 |
Characteristics of erroneous and correct cases classified by boosting logistic regression over the test set. The measures are the same as Table 10
| Category | # cases (%) | Perspective | Min | Q1 | Median | Q3 | Max | Mean | Std |
|---|---|---|---|---|---|---|---|---|---|
| FP | 396 (27.75%) | B | 0 | 1 | 2 | 4 | 17 | 2.6869 | 2.5051 |
| T | 0 | 2 | 3 | 5 | 14 | 3.9444 | 2.9253 | ||
| FN | 161 (11.28%) | B | 0 | 0 | 1 | 2 | 13 | 1.5962 | 2.1041 |
| T | 0 | 1 | 1 | 2 | 10 | 1.8758 | 2.0574 | ||
| TP | 543 (38.05%) | B | 0 | 1 | 2 | 4 | 24 | 2.9650 | 3.2801 |
| T | 0 | 2 | 3 | 6 | 17 | 4.0166 | 2.8669 | ||
| TN | 327 (22.92%) | B | 0 | 1 | 1 | 2 | 20 | 1.7248 | 2.1139 |
| T | 0 | 0 | 1 | 2 | 19 | 1.6575 | 2.0836 |
Figure 9Comparative F1 scores for boosting logistic regression over the training and testing sets. The legend shows the features used in the training set and the corresponding columns are presented in the testing set.