| Literature DB >> 26047637 |
Komandur Elayavilli Ravikumar1, Kavishwar B Wagholikar2, Dingcheng Li3, Jean-Pierre Kocher4, Hongfang Liu5.
Abstract
BACKGROUND: Advances in the next generation sequencing technology has accelerated the pace of individualized medicine (IM), which aims to incorporate genetic/genomic information into medicine. One immediate need in interpreting sequencing data is the assembly of information about genetic variants and their corresponding associations with other entities (e.g., diseases or medications). Even with dedicated effort to capture such information in biological databases, much of this information remains 'locked' in the unstructured text of biomedical publications. There is a substantial lag between the publication and the subsequent abstraction of such information into databases. Multiple text mining systems have been developed, but most of them focus on the sentence level association extraction with performance evaluation based on gold standard text annotations specifically prepared for text mining systems.Entities:
Mesh:
Year: 2015 PMID: 26047637 PMCID: PMC4457984 DOI: 10.1186/s12859-015-0609-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Role of human inference in database curation
Fig. 2System architecture
Fig. 3Steps in dependency parse graph traversal
Fig. 4Linking dependency graphs on entity identity
Fig. 5Linking dependency graphs based on discourse analysis
Fig. 6Experimental design
Fig. 7Extraction of gold standard from UniProtKB
Gold standard relation statistics (Development and test data set)
| S. No | Relation | Data set | Total numbers |
|---|---|---|---|
| 1 | Protein-Mutation-Disease (PMD) | D | 631 |
| T | 264 | ||
| 2 | Protein – Mutation (PM) | D | 879 |
| T | 388 | ||
| 3 | Protein-Disease (PD) | D | 671 |
| T | 295 |
D – Development set; T – Test set
Evaluation results on development (D) and test (T) data sets
| Sys | Set | PMD | PM | MD | PD | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P | R | F | P | R | F | P | R | F | P | R | F | ||
| S1 | D | 60.0 | 69.3 | 64.3 | 69.2 | 68.4 | 68.8 | 61.7 | 67.7 | 64.6 | 67.2 | 80.6 | 73.3 |
| T | 52.6 | 72.0 | 60.8 | 65.5 | 71.4 | 68.3 | 57.0 | 70.9 | 63.2 | 61.1 | 80.9 | 69.6 | |
| S2 | D | 76.2 | 39.0 | 51.6 | 82.6 | 48.1 | 48.1 | 74.8 | 44.3 | 55.6 | 84.6 | 59.8 | 70.0 |
| T | 77.3 | 41.3 | 53.8 | 76.3 | 43.0 | 55.0 | 67.8 | 43.6 | 53.1 | 74.7 | 61.4 | 67.4 | |
| S3 | D | 77.1 | 36.3 | 49.7 | 84.4 | 45.6 | 59.2 | 78.2 | 42.5 | 55.0 | 89.3 | 57.4 | 69.9 |
| T | 78.7 | 36.4 | 49.7 | 79.1 | 38.9 | 52.2 | 77.2 | 41.8 | 54.3 | 76.7 | 59.7 | 67.2 | |
| S4 | D | 76.4 | 52.3 | 60.3 | 84.4 | 45.6 | 59.2 | 78.2 | 42.5 | 55.0 | 89.3 | 57.4 | 69.9 |
| T | 75.8 | 52.3 | 61.9 | 79.1 | 38.9 | 52.2 | 77.2 | 41.8 | 54.3 | 76.7 | 59.7 | 67.2 | |
| S5 | D | 75.8 | 59.6 | 66.7 | 81.7 | 57.0 | 67.2 | 75.9 | 60.0 | 67.0 | 88.6 | 63.8 | 74.2 |
| T | 71.6 | 58.3 | 64.3 | 76.8 | 58.0 | 67.2 | 74.8 | 59.3 | 66.1 | 76.2 | 67.7 | 71.7 |
Sys – Systems; PMD – Protein-Mutation-Disease relationships; PM – Protein-Mutation relationships; MD – Mutation-Disease relationships; PD – Protein-disease relationships; S1 (System1) – Abstract level co-occurrence; S2 (System2) – Sentence level co-occurrence; S3 (System3) – Sentence level dependency graph based traversal; S4 (System4) – Linking two dependency graphs based on entity identity; S5 (System5) – Linking two or more graphs based on anaphora resolution/trigger words.; P – Precision (in %); R – Recall (in %); F – F-measure (in %)
Fig. 8Performance trend of systems on development and test data set
Fig. 9Comparison of performance of systems on test data
Fig. 10Percentage distribution of precision errors on test data set
Fig. 11Percentage distribution of recall errors on test data set