| Literature DB >> 33167858 |
Jinchan Qu1, Albert Steppi2, Dongrui Zhong1, Jie Hao1, Jian Wang3, Pei-Yau Lung4, Tingting Zhao5, Zhe He6, Jinfeng Zhang7.
Abstract
BACKGROUND: Information on protein-protein interactions affected by mutations is very useful for understanding the biological effect of mutations and for developing treatments targeting the interactions. In this study, we developed a natural language processing (NLP) based machine learning approach for extracting such information from literature. Our aim is to identify journal abstracts or paragraphs in full-text articles that contain at least one occurrence of a protein-protein interaction (PPI) affected by a mutation.Entities:
Keywords: Biomedical literature retrieval; Mutations; Protein interactions affected by mutations; Protein-protein interactions; Text mining
Mesh:
Year: 2020 PMID: 33167858 PMCID: PMC7654050 DOI: 10.1186/s12864-020-07185-7
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Result of document triage task of the BioCreative VI precision medicine track. Our team ID is 433. Our method achieved highest recall and comparable F1 score among all the methods better than the baseline model
| Team | Precision | Recall | F1 | AvPr |
|---|---|---|---|---|
| 421 | 0.6073 | 0.7997 | 0.6904 | 0.7253 |
| 418 | 0.6289 | 0.7656 | 0.6906 | 0.7185 |
| 374 | 0.6070 | 0.7898 | 0.6864 | 0.6929 |
| 375 | 0.5783 | 0.7713 | 0.6610 | 0.6822 |
| 433 | 0.5413 | 0.8835 | 0.6713 | 0.6632 |
| Baseline | 0.6122 | 0.6435 | 0.6274 | 0.6515 |
| 420 | 0.5438 | 0.8736 | 0.6703 | 0.6439 |
| 419 | 0.5992 | 0.6222 | 0.6105 | 0.6334 |
| 405 | 0.5484 | 0.5710 | 0.5595 | 0.5871 |
| 414 | 0.5022 | 0.9801 | 0.6641 | 0.5008 |
| 379 | 0.4649 | 0.3480 | 0.3981 | 0.4904 |
Post-competition improvement of our method. +w2v is the revised method with word2vec embedding
| Model | Validation | Precision | Recall | F1 | AvPr |
|---|---|---|---|---|---|
| Original | 10f CV (Train) | 0.6253 | 0.8208 | 0.7098 | 0.7148 |
| Original | Test | 0.5823 | 0.8096 | 0.6774 | 0.6785 |
| +w2v | 10f CV (Train) | 0.6264 | 0.8150 | 0.7084 | 0.7138 |
| +w2v | Test | 0.5651 | 0.8509 | 0.6791 | 0.6962 |
Fig. 1Plot of precision, recall and F1 score versus cutoff value
Baseline model performance on the BioCreative VI precision medicine track corpus
| Data | Precision | Recall | F1 | F1 all relevant | AvPr |
|---|---|---|---|---|---|
| 10f CV (IntAct) | 0.7184 | 0.6321 | 0.6725 | 0.5507 | 0.7577 |
| Validation (TM) | 0.6210 | 0.6897 | 0.6536 | 0.6842 | 0.6551 |
| 10f CV (all data) | 0.6891 | 0.6260 | 0.6561 | 0.5915 | 0.7225 |
AvPr: Average precision; 10f CV: 10-fold Cross-validation; TM: Text Mining set, corpus of abstracts found with the aid of text mining methods
Fig. 2Illustration of our method. In the “Feature Engineering” boxes, the major tools/algorithms used in each step are mentioned in the parentheses
Fig. 3An example of dependency parsing. The labeled arcs describe the dependency between two words
Fig. 4The shortest path between PROT1 and PROT2 in the example shown in Fig. 3 (they are shown as “Protein A” and “Protein B” in Fig. 3)
Fig. 5The shortest path between PROT1 and variant in the example shown in Fig. 3