| Literature DB >> 30958864 |
Sonja Hatz1, Scott Spangler2, Andrew Bender3, Matthew Studham3, Philipp Haselmayer1, Alix M B Lacoste4, Van C Willis5, Richard L Martin5, Harsha Gurulingappa1, Ulrich Betz1.
Abstract
BACKGROUND: Pharmacodynamic biomarkers are becoming increasingly valuable for assessing drug activity and target modulation in clinical trials. However, identifying quality biomarkers is challenging due to the increasing volume and heterogeneity of relevant data describing the biological networks that underlie disease mechanisms. A biological pathway network typically includes entities (e.g. genes, proteins and chemicals/drugs) as well as the relationships between these and is typically curated or mined from structured databases and textual co-occurrence data. We propose a hybrid Natural Language Processing and directed relationships-based network analysis approach using IBM Watson for Drug Discovery to rank all human genes and identify potential candidate biomarkers, requiring only an initial determination of a specific target-disease relationship.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30958864 PMCID: PMC6453528 DOI: 10.1371/journal.pone.0214619
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
References to biomedical relation extraction from text.
| Reference | Data set | Methods /Dictionaries | Benchmarking datasets | Biological question |
|---|---|---|---|---|
| van Haagen et al. 2009 [ | 12 Million Medline Abstracts to July 2007 | ER, CO, | Biogrid, DIP, HPRD, IntAct, MINT, Reactome, and UniProt/Swiss-Prot were used to establish a set of 61,807 known human PPIs. | Protein-Protein interactions |
| Bravo et al. 2014 [ | MEDLINE abstracts annotated with MeSH term “pharmacological biomarkers” and “biological markers”: 164,300 abstracts | ER (bioNER) | 686172 cooccurrences between 2803 biomarkers and 2751 diseases | Biomarker-disease associations |
| Mihăilă, C. and Ananiadou, S. 2014 [ | BioCause corpus, a collection of 19 open-access full-text journal articles pertaining to the subdomain of infectious diseases | RE, CS | BioCause corpus (infectious disease). | Infectious disease |
| Ahlers et. al. 2007 [ | Semantic medline | ER, CO, RE | Gold standard annotation of 300 sentences (selected by co-occurring drugs and genes) with 850 predictions. 55% recall and 73% precision | Pharmacodynamic effects of drugs |
| Ahmed et. al. 2018 [ | AIMed and BioInfer benchmark datasets which are subsets of PubMed articles annotated with Protein-Protein interactions | RE applied to train and evaluate tree recurrent neural network architecture. | Validation was done by 10-fold cross validation over AIMed and Bioinfer datasets which yielded F1-scores of 81% and 89% respectively | Protein-Protein interactions |
| Vlietstra et. al. 2017 [ | Not defined (“triples from text and databases”) | ER | Systematic literature review of 234 studies: 163 of 222 compounds ranked in top 2000 of 51 000 extracted compounds | Diagnostic biomarkers for migrane in blood and CSF |
| Chang et. al. 2017 [ | PubMed queries (Keyword in title and abstract) 12052 articles | ER, CO | Extracted 2128 gene/protein biomarker candidates and compared with several online resources (incl. liverome, MarkerRIF, GeneCards, Malacards, COSMIC). Comparison with HCC-related databases showed retrieval of between 20% (Liverome, omics data) and 50% (MarkerRIF, manually curated from literature) | Diagnostic biomarkers for hepatocellular carcinoma. |
| Jurca 2016 et al. [ | MEDLINE abstracts, API search for “breast cancer”, used those abstracts that contained genes | ER (BeCAS), CO | Non systematic: Compared co-expressed genes (experimental data) in GeneMania (PPI from BioGRID and pathway commons) compared to a community formed by network analysis | Diagnostic biomarkers for breast cancer |
ER, entity recognition; CO, co-occurence analysis; RE, relationship extraction; CS, cross-sentence references; PPI, protein-protein interaction; ROC, area under receiver operating characteristic curve; CSF, cerebrospinal fluid.
Fig 1Overview of pipeline used for identification of putative BTK inhibition biomarkers.
Pubmed abstracts were used as an input for generating a working corpus from which gene entities were identified, selected, and normalized for generation of a relationship network. This relationship network was used for comparison to the MetaCore curated database and further analysis using matrix factorization for the prediction of potential BTK biomarkers and compared to a list of potential BTK biomarkers created by subject matter experts.
Top 50 genes predicted by WDD to be downstream of BTK.
| Gene | Score | Known Downstream to BTK | Rank (including known) |
|---|---|---|---|
| 0.54458815 | 1 | 1 | |
| 0.5154992 | 1 | 2 | |
| 0.50323606 | 0 | 3 | |
| 0.4434148 | 1 | 4 | |
| 0.4385939 | 1 | 5 | |
| 0.4313362 | 1 | 6 | |
| 0.42688292 | 1 | 7 | |
| 0.41607705 | 1 | 8 | |
| 0.41537482 | 1 | 9 | |
| 0.38802144 | 0 | 10 | |
| 0.3744676 | 1 | 11 | |
| 0.36648414 | 0 | 12 | |
| 0.35648954 | 1 | 13 | |
| 0.3489667 | 0 | 14 | |
| 0.34443948 | 1 | 15 | |
| 0.3414984 | 0 | 16 | |
| 0.33224836 | 0 | 17 | |
| 0.3214404 | 0 | 18 | |
| 0.31818953 | 0 | 19 | |
| 0.3179956 | 0 | 20 | |
| 0.31714404 | 1 | 21 | |
| 0.3144715 | 0 | 22 | |
| 0.31039178 | 1 | 23 | |
| 0.303452 | 0 | 24 | |
| 0.30185032 | 0 | 25 | |
| 0.30117995 | 0 | 26 | |
| 0.30093542 | 0 | 27 | |
| 0.29676566 | 1 | 28 | |
| 0.2944623 | 1 | 29 | |
| 0.28429946 | 1 | 30 | |
| 0.28102204 | 1 | 31 | |
| 0.28000763 | 1 | 32 | |
| 0.27899128 | 1 | 33 | |
| 0.27572146 | 0 | 34 | |
| 0.27440986 | 1 | 35 | |
| 0.27148065 | 1 | 36 | |
| 0.2702503 | 0 | 37 | |
| 0.2699324 | 1 | 38 | |
| 0.2696973 | 0 | 39 | |
| 0.26840165 | 0 | 40 | |
| 0.26694292 | 0 | 41 | |
| 0.26671404 | 1 | 42 | |
| 0.2651292 | 1 | 43 | |
| 0.26320416 | 0 | 44 | |
| 0.2610977 | 1 | 45 | |
| 0.26066443 | 0 | 46 | |
| 0.25447765 | 1 | 47 | |
| 0.25388432 | 0 | 48 | |
| 0.25309193 | 0 | 49 | |
| 0.25100806 | 0 | 50 |
Includes known genes (those already in WDD’s network). Full table of 13,595 ranked genes available in supplemental Table 1.
WDD matrix factorization ranking of known BTK targets.
| Gene | WDD Rank | Percentile |
|---|---|---|
| 2240 | 16% | |
| 506 | 4% | |
| 124 | 1% | |
| 384 | 3% | |
| 1448 | 11% | |
| 379 | 3% | |
| 2775 | 20% | |
| 284 | 2% | |
| 997 | 7% | |
| 3 | 0.02% | |
| 10558 | 78% | |
| 9933 | 73% | |
| 654 | 5% |
Fig 2Receiver operating characteristic curve of WDD-predicted versus known BTK interactions.
Resulting receiver operating characteristic curve from analysis comparing WDD-predicted BTK biomarker ranking to a subject matter expert-derived list of potential BTK biomarkers. Area under the curve = 0.82.
Co-occurrence ranking of known BTK targets.
| Gene | Abstract | Rank |
|---|---|---|
| 49 | 10 | |
| 9 | 84 | |
| 4 | 186 | |
| 3 | 241 | |
| 3 | 241 | |
| 1 | 684 | |
| 1 | 684 | |
| 0 | 7272 | |
| 0 | 7272 | |
| 0 | 7272 | |
| 0 | 7272 | |
| 0 | 7272 | |
| 0 | 7272 |
Analysis of similar STAT3 matrix rows.
| Similar Gene | # of shared connections to STAT3 | Total Connections | P value |
|---|---|---|---|
| 273 | 446 | 6.22E-17 | |
| 102 | 145 | 1.70E-11 | |
| 154 | 269 | 1.36E-06 | |
| 218 | 409 | 1.12E-05 | |
| 100 | 173 | 8.04E-05 | |
| 99 | 173 | 1.54E-04 | |
| 88 | 156 | 7.59E-04 | |
| 160 | 311 | 0.00240641 | |
| 181 | 358 | 0.00343664 | |
| 231 | 468 | 0.00430228 | |
| 107 | 202 | 0.00449299 | |
| 36 | 59 | 0.00594752 | |
| 85 | 159 | 0.00865486 |
a p values were calculated using a Chi Squared test comparing the number of expected shared connections and (based on individual frequencies) the actual number of shared connections.