| Literature DB >> 27506226 |
Dong Xu1, Meizhuo Zhang2, Yanping Xie1, Fan Wang1, Ming Chen2, Kenny Q Zhu1, Jia Wei2.
Abstract
MOTIVATION: Biomedical researchers often search through massive catalogues of literature to look for potential relationships between genes and diseases. Given the rapid growth of biomedical literature, automatic relation extraction, a crucial technology in biomedical literature mining, has shown great potential to support research of gene-related diseases. Existing work in this field has produced datasets that are limited both in scale and accuracy.Entities:
Mesh:
Year: 2016 PMID: 27506226 PMCID: PMC5181534 DOI: 10.1093/bioinformatics/btw503
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Features extracted for association detection
| Feature | Type | Description |
|---|---|---|
| 1 | Local lexical feature | Lemmas of the two words in front of the gene term and the two words behind the gene term, and lemmas of the two words in front of the disease term and the two words behind the disease term |
| 2 | Global syntactic feature | Unigram, bigram and trigram of lemmas on the shortest path between the gene and the disease terms in the dependency tree |
| 3 | Global syntactic feature | Unigram, bigram and trigram of lemmas on the path between the LCA of the gene and the disease terms and the root of the dependency tree |
Fig. 1.Work flow of extracting gene–disease associations from MEDLINE
Fig. 2.Part of the Dependency Tree of sentence ‘All three complementary approaches employed (family-based, case-control and quantitative trait design) suggests a role for the MAO A promoter-region polymorphism in conferring risk for ADHD in our patient population’
Fig. 3.Simplified paper citation network
Results of extracted associations compared to BeFree
| Associations | Genes | Diseases | |
|---|---|---|---|
| DTMiner | |||
| BeFree | 131 012 | 2803 | 2751 |
Bold text signifies the best performer in the column.
Results of gene/disease recognition
| Precision | Recall | F-score | |
|---|---|---|---|
| ABNER | 0.593 | 0.549 | 0.57 |
| Only dictionary | 0.839 | 0.659 | 0.738 |
| Stanford NER tool | 0.524 | 0.673 | |
| Before enriched by Bing | 0.851 | 0.875 | 0.863 |
| After enriched by Bing | 0.87 |
Bold text signifies the best performer in the column.
Results of gene/disease relation extraction
| Feature | Precision | Recall | F-score |
|---|---|---|---|
| Local Lexical Feathers | 0.761 | 0.748 | 0.755 |
| Global Syntactic Features | 0.827 | 0.853 | 0.839 |
| Local+Global features |
Bold text signifies the best performer in the column.
Comparison of relation extraction performance
| Framework | F-score | Training Time (s) | Testing Time (s) |
|---|---|---|---|
| DTMiner | 0.863 | ||
| BeFree | 384 | 4393 |
Bold text signifies the best performer in the column.
MRR results of different ranking methods
| Disease | Frequency- based | PageRank- based | Suppressed PageRank | Weighted PageRank |
|---|---|---|---|---|
| Retinitis pigmentosa | 0.111 | 0.141 | 0.146 | 0.156 |
| Adrenal gland chromaffinoma | 0.161 | 0.194 | 0.212 | 0.207 |
| Bipolar I disorder | 0.26 | 0.268 | 0.273 | 0.26 |
| Hyperlipidemia | 0.267 | 0.279 | 0.325 | 0.325 |
| Papilloma | 0.183 | 0.24 | 0.321 | 0.226 |
| Thrombocytopenia | 0.089 | 0.072 | 0.063 | 0.063 |
| Glioblastoma | 0.278 | 0.3 | 0.346 | 0.274 |
| Hernia diaphragmatic | 0.026 | 0.028 | 0.029 | 0.027 |
| Brain ischemia | 0.052 | 0.076 | 0.067 | 0.06 |
| Cerebrovascular accident | 0.178 | 0.178 | 0.147 | 0.175 |
| Overall | 0.161 | 0.178 | 0.193 | 0.177 |
F-score of top-K of the rankings
| Top K | Freq-Based | PR-Based | Weighted PR | Sup-PR | BeFree | CoPub |
|---|---|---|---|---|---|---|
| K = 50 | 0.221 | 0.228 | 0.237 | 0.213 | 0.212 | |
| K = 100 | 0.236 | 0.235 | 0.230 | 0.214 | 0.211 | |
| K = 150 | 0.22 | 0.216 | 0.216 | 0.186 | 0.192 | |
| K = 200 | 0.196 | 0.196 | 0.197 | 0.167 | 0.175 |
Fig. 4.F-score of Top-K Rankings