| Literature DB >> 28077568 |
Ferhat Aydın1, Zehra Melce Hüsünbeyi1, Arzucan Özgür2.
Abstract
Information regarding the physical interactions among proteins is crucial, since protein-protein interactions (PPIs) are central for many biological processes. The experimental techniques used to verify PPIs are vital for characterizing and assessing the reliability of the identified PPIs. A lot of information about PPIs and the experimental methods are only available in the text of the scientific publications that report them. In this study, we approach the problem of identifying passages with experimental methods for physical interactions between proteins as an information retrieval search task. The baseline system is based on query matching, where the queries are generated by utilizing the names (including synonyms) of the experimental methods in the Proteomics Standard Initiative-Molecular Interactions (PSI-MI) ontology. We propose two methods, where the baseline queries are expanded by including additional relevant terms. The first method is a supervised approach, where the most salient terms for each experimental method are obtained by using the term frequency-relevance frequency (tf.rf) metric over 13 articles from our manually annotated data set of 30 full text articles, which is made publicly available. On the other hand, the second method is an unsupervised approach, where the queries for each experimental method are expanded by using the word embeddings of the names of the experimental methods in the PSI-MI ontology. The word embeddings are obtained by utilizing a large unlabeled full text corpus. The proposed methods are evaluated on the test set consisting of 17 articles. Both methods obtain higher recall scores compared with the baseline, with a loss in precision. Besides higher recall, the word embeddings based approach achieves higher F-measure than the baseline and the tf.rf based methods. We also show that incorporating gene name and interaction keyword identification leads to improved precision and F-measure scores for all three evaluated methods. The tf.rf based approach was developed as part of our participation in the Collaborative Biocurator Assistant Task of the BioCreative V challenge assessment, whereas the word embeddings based approach is a novel contribution of this article.Database URL: https://github.com/ferhtaydn/biocemid/.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28077568 PMCID: PMC5225401 DOI: 10.1093/database/baw166
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Sample text with multiple PPIs and experimental methods taken from the Results section of (30). The text describes three experimental interaction detection methods used to identify the proteins interacting with the ‘TBK1’ protein. The passages describing the experimental interaction detection methods ‘tandem affinity purification’ (MI:0676), ‘mass spectrometry studies of complexes’ (MI:0069), and ‘coimmunoprecipitation’ (MI:0019) are highlighted with yellow, purple and green, respectively.
Figure 2.A sample annotation from a paragraph of an article in the data set. Each annotation has an identifier that is incremented by one throughout the article. Moreover, the value of the ‘type’ infon is static and set to ‘ExperimentalMethod’ for all annotations. The value of the ‘PSIMI’ infon is set to the PSI-MI identifier of the interaction detection method. The ‘text’ tag holds the annotated sentence(s). The ‘location’ tag holds the position of the annotated portion in the article with the ‘offset’ and ‘length’ attributes.
List of experimental interaction detection methods which are annotated in at least one article in the manually annotated data set
| Id | Name | Articles | Passages |
|---|---|---|---|
| MI:0004 | affinity chromatography technology | 3 | 5 |
| MI:0006 | anti bait coimmunoprecipitation | 8 | 23 |
| MI:0007 | anti tag coimmunoprecipitation | 7 | 14 |
| MI:0014 | adenylate cyclase complementation | 2 | 2 |
| MI:0017 | classical fluorescence spectroscopy | 1 | 4 |
| MI:0018 | two hybrid | 10 | 54 |
| MI:0019 | coimmunoprecipitation | 14 | 49 |
| MI:0029 | cosedimentation through density gradient | 1 | 1 |
| MI:0030 | cross-linking study | 2 | 5 |
| MI:0040 | electron microscopy | 1 | 4 |
| MI:0053 | fluorescence polarization spectroscopy | 1 | 1 |
| MI:0054 | fluorescence-activated cell sorting | 3 | 4 |
| MI:0055 | fluorescent resonance energy transfer | 3 | 10 |
| MI:0065 | isothermal titration calorimetry | 3 | 8 |
| MI:0071 | molecular sieving | 4 | 9 |
| MI:0077 | nuclear magnetic resonance | 4 | 27 |
| MI:0081 | peptide array | 1 | 4 |
| MI:0096 | pull down | 15 | 43 |
| MI:0104 | static light scattering | 1 | 1 |
| MI:0107 | surface plasmon resonance | 2 | 3 |
| MI:0114 | x-ray crystallography | 5 | 21 |
| MI:0276 | blue native page | 1 | 2 |
| MI:0402 | chromatin immunoprecipitation assay | 5 | 24 |
| MI:0411 | enzyme linked immunosorbent assay | 2 | 3 |
| MI:0412 | electrophoretic mobility supershift assay | 1 | 2 |
| MI:0413 | electrophoretic mobility shift assay | 1 | 6 |
| MI:0416 | fluorescence microscopy | 5 | 15 |
| MI:0419 | gtpase assay | 2 | 4 |
| MI:0423 | in-gel kinase assay | 1 | 1 |
| MI:0426 | light microscopy | 1 | 1 |
| MI:0663 | confocal microscopy | 3 | 6 |
| MI:0676 | tandem affinity purification | 1 | 4 |
| MI:0809 | bimolecular fluorescence complementation | 1 | 8 |
| MI:0858 | immunodepleted coimmunoprecipitation | 1 | 1 |
| MI:0889 | acetylase assay | 1 | 1 |
Figure 3.Overall system workflow.
The initial queries for the ‘affinity chromatography technology’ (MI:0004), ‘two hybrid’ (MI:0018) and ‘pull down’ (MI:0096) experimental methods
| MI:0004 | MI:0018 | MI:0096 |
|---|---|---|
| affinity chromatography technology | two hybrid | pull down |
| affinity chrom | two-hybrid | |
| affinity purification | yeast two hybrid | |
| 2 hybrid | ||
| 2-hybrid | ||
| y2h | ||
| classical two hybrid | ||
| gal4 transcription regeneration | ||
| 2h |
Expanded query for the ‘pull down’ experimental method
| Names | Tier 1 Terms | Tier 2 Terms |
|---|---|---|
| pull down | pull-down | flag-tagged |
| down | pull | |
| pulled | pulled-down | |
| gst | gst-fusion | |
| his-tagged | glutathione | |
| s-transferase | glutathione-sepharose | |
| affinity |
The names are extracted from the PSI-MI ontology. The Tier 1 and Tier 2 terms are extracted manually based on tf.rf weights.
Expanded query for the ‘pull down’ experimental method
| Names | Tier 1 Terms | Tier 2 Terms |
|---|---|---|
| pull down | pull-down | binding |
| gst | gst-hnrnp-k | |
| rab5 | recombinant | |
| appl1 | interaction | |
| down | his-tagged | |
| proteins | protein | |
| melk | mutations |
The names are extracted from the PSI-MI ontology. The Tier 1 and Tier 2 terms are constructed automatically from the first 7 and second 7 top terms of tf.rf weights.
Expanded query for the ‘pull down’ experimental method
| Names | Tier 1 Terms | Tier 2 Terms |
|---|---|---|
| pull down | pull-down | interaction |
| gst | his-tagged | |
| rab5 | protein | |
| appl1 | mutations | |
| down | pull | |
| proteins | used | |
| melk | gtp | |
| binding | assay | |
| gst-hnrnp-k | fusion | |
| recombinant | figure |
The names are extracted from the PSI-MI ontology. The Tier 1 and Tier 2 terms are constructed automatically from the first 10 and second 10 top terms of tf.rf weights.
Figure 4.The algorithm for combining the word2vec results of each experimental method into one list.
Figure 5.The algorithm for cleaning the given list of an experimental method.
Figure 6.The algorithm for filtering the longer terms with lower scores from the given list of an experimental method.
Figure 7.The query expansion algorithm of wor2vec approach.
The expanded query terms of ‘two hybrid’ (MI:0018) are shown in bold
| Terms | Scores |
|---|---|
The italic terms are eliminated from the word2vec results in the cleaning operation as explained in Figure 5. The terms with score 1.0 are the initial query items (name or synonyms). The terms which already contain (after splitting with space) any name or synonym are also eliminated, so italicized.
Figure 8.An example paragraph which shows our evaluation logic over three sample manual and system annotations. The manually annotated passages are underlined with red and green for ‘bimolecular fluorescence complementation’ (MI:0809) and ‘two hybrid’ (MI:0018) experimental methods, respectively. The annotated passages by the system are colored with blue and purple for ‘bimolecular fluorescence complementation’ (MI:0809) and ‘two hybrid’ (MI:0018) experimental methods, respectively.
Performances of the methods on the test set
| Precision | Recall | ||
|---|---|---|---|
| baseline | 0.424 | 0.418 | 0.421 |
| baseline.genia.ino | 0.484 | 0.413 | 0.446 |
| tf.rf.f7s7 | 0.120 | 0.508 | 0.194 |
| tf.rf.f7s7.genia.ino | 0.133 | 0.502 | 0.211 |
| tf.rf.f10s10 | 0.068 | 0.512 | 0.119 |
| tf.rf.f10s10.genia.ino | 0.074 | 0.507 | 0.129 |
| tf.rf.manual | 0.315 | 0.508 | 0.389 |
| tf.rf.manual.genia.ino | 0.357 | 0.503 | 0.418 |
| word2vec | 0.321 | 0.618 | 0.422 |
| word2vec.genia.ino | 0.362 | 0.606 | 0.453 |
Additional results on the test set
| Precision | Recall | ||
|---|---|---|---|
| 0.362 | 0.606 | 0.453 | |
| 0.390 | 0.645 | 0.486 | |
| 0.439 | 0.751 | 0.554 |
The system evaluation when the ‘anti bait coimmunoprecipitation’ and ‘anti tag coimmunoprecipitation’ methods are regarded as ‘coimmunoprecipitation’ is shown with ‘word2vec.genia.ino.coip’. The system evaluation without the requirement of experimental method ID matching is shown with ‘word2vec.genia.ino.psimi’.