| Literature DB >> 31437145 |
Somnath Tagore1, Alessandro Gorohovski1, Lars Juhl Jensen2, Milana Frenkel-Morgenstern1.
Abstract
Tailored therapy aims to cure cancer patients effectively and safely, based on the complex interactions between patients' genomic features, disease pathology and drug metabolism. Thus, the continual increase in scientific literature drives the need for efficient methods of data mining to improve the extraction of useful information from texts based on patients' genomic features. An important application of text mining to tailored therapy in cancer encompasses the use of mutations and cancer fusion genes as moieties that change patients' cellular networks to develop cancer, and also affect drug metabolism. Fusion proteins, which are derived from the slippage of two parental genes, are produced in cancer by chromosomal aberrations and trans-splicing. Given that the two parental proteins for predicted fusion proteins are known, we used our previously developed method for identifying chimeric protein-protein interactions (ChiPPIs) associated with the fusion proteins. Here, we present a validation approach that receives fusion proteins of interest, predicts their cellular network alterations by ChiPPI and validates them by our new method, ProtFus, using an online literature search. This process resulted in a set of 358 fusion proteins and their corresponding protein interactions, as a training set for a Naïve Bayes classifier, to identify predicted fusion proteins that have reliable evidence in the literature and that were confirmed experimentally. Next, for a test group of 1817 fusion proteins, we were able to identify from the literature 2908 PPIs in total, across 18 cancer types. The described method, ProtFus, can be used for screening the literature to identify unique cases of fusion proteins and their PPIs, as means of studying alterations of protein networks in cancers. Availability: http://protfus.md.biu.ac.il/.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31437145 PMCID: PMC6705771 DOI: 10.1371/journal.pcbi.1007239
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 2N-gram model for detecting N-words by ProtFus.
The N-gram model and some possible sets of combinations.
Datasets considered for training.
(collected from PubMed between January 2013 and April 2017).
| PubMed Year | Abstracts | Full Texts | “Fusion proteins” | “Fusion proteins”+PPI |
|---|---|---|---|---|
| 2017 | 17220 | 5212 | 43 | 2 |
| 2016 | 352097 | 163884 | 1164 | 99 |
| 2015 | 353972 | 171432 | 1132 | 104 |
| 2014 | 321314 | 156091 | 1187 | 112 |
| 2013 | 299380 | 141512 | 1203 | 110 |
Datasets considered for testing ProtFus.
| PubMed Year | Abstracts | Full Texts | ”Fusion proteins” | “Fusion proteins”+PPI |
|---|---|---|---|---|
| 2017 | 25830 | 7819 | 65 | 5 |
| 2016 | 528146 | 245826 | 1747 | 148 |
| 2015 | 530960 | 257148 | 1697 | 155 |
| 2014 | 481971 | 234136 | 1780 | 167 |
| 2013 | 449069 | 212268 | 1805 | 165 |
Bag-of-words collection for 10 PubMed ID abstracts.
| PMID | Fusion proteins | Fusion Gene | Biological Token | Miscellaneous Token |
|---|---|---|---|---|
| 24186139 | 1 | 1 | 20 | 35 |
| 22101766 | 0 | 1 | 25 | 30 |
| 18451133 | 0 | 1 | 28 | 38 |
| 11930009 | 1 | 1 | 26 | 32 |
| 15735689 | 0 | 1 | 21 | 34 |
| 18850010 | 0 | 0 | 27 | 33 |
| 21193423 | 1 | 0 | 23 | 33 |
| 22570737 | 1 | 0 | 30 | 38 |
| 18383210 | 1 | 0 | 29 | 35 |
| 24345920 | 1 | 0 | 26 | 32 |
| 16502585 | 1 | 0 | 21 | 33 |
Precision and Recall for retrieval step.
| Dataset | Precision | Recall | F-Score | Accuracy |
|---|---|---|---|---|
| Set A | 0.79 | 0.82 | 0.76 | 0.81 |
| Set B | 0.81 | 0.83 | 0.78 | 0.80 |
| Set C | 0.85 | 0.84 | 0.82 | 0.85 |
| Set D | 0.72 | 0.76 | 0.72 | 0.74 |
| Set E | 0.80 | 0.82 | 0.78 | 0.82 |
| Set F | 0.81 | 0.81 | 0.78 | 0.82 |
| Set G | 0.78 | 0.83 | 0.81 | 0.83 |
| Set H | 0.75 | 0.81 | 0.78 | 0.80 |
| Set I | 0.85 | 0.82 | 0.81 | 0.83 |
| Set J | 0.73 | 0.78 | 0.76 | 0.75 |
Precision and Recall for named-entity recognition.
| Dataset | Precision | Recall | F-Score | Accuracy |
|---|---|---|---|---|
| Set A | 0.79 | 0.82 | 0.77 | 0.81 |
| Set B | 0.77 | 0.82 | 0.80 | 0.82 |
| Set C | 0.87 | 0.83 | 0.82 | 0.89 |
| Set D | 0.80 | 0.81 | 0.76 | 0.78 |
| Set E | 0.84 | 0.83 | 0.82 | 0.82 |
| Set F | 0.81 | 0.84 | 0.83 | 0.83 |
| Set G | 0.81 | 0.89 | 0.85 | 0.84 |
| Set H | 0.82 | 0.82 | 0.84 | 0.80 |
| Set I | 0.82 | 0.84 | 0.83 | 0.87 |
| Set J | 0.78 | 0.80 | 0.77 | 0.79 |
Accuracy score of classifiers.
| Dataset | Precision | Recall | F-Score | Accuracy |
|---|---|---|---|---|
| Set A | 0.82 | 0.86 | 0.79 | 0.84 |
| Set B | 0.83 | 0.85 | 0.82 | 0.83 |
| Set C | 0.91 | 0.92 | 0.89 | 0.91 |
| Set D | 0.79 | 0.81 | 0.74 | 0.77 |
| Set E | 0.86 | 0.85 | 0.83 | 0.85 |
| Set F | 0.85 | 0.85 | 0.83 | 0.84 |
| Set G | 0.85 | 0.87 | 0.84 | 0.85 |
| Set H | 0.81 | 0.83 | 0.83 | 0.82 |
| Set I | 0.87 | 0.86 | 0.84 | 0.86 |
| Set J | 0.75 | 0.81 | 0.79 | 0.78 |
Performance of ProtFus compared to other resources.
| Resource | Full-Text | Extraction |
|---|---|---|
| ChimerDB-3.0 | Yes | 82% |
| FusionCancer (does not use text mining) | Yes | NA |
| FusionDB (does not use text mining) | Yes | NA |
| ProtFus | Yes | 92% |