| Literature DB >> 25810770 |
Andre Lamurias1, João D Ferreira1, Francisco M Couto1.
Abstract
BACKGROUND: Our approach to the BioCreative IV challenge of recognition and classification of drug names (CHEMDNER task) aimed at achieving high levels of precision by applying semantic similarity validation techniques to Chemical Entities of Biological Interest (ChEBI) mappings. Our assumption is that the chemical entities mentioned in the same fragment of text should share some semantic relation. This validation method was further improved by adapting the semantic similarity measure to take into account the h-index of each ancestor. We applied this method in two measures, simUI and simGIC, and validated the results obtained for the competition, comparing each adapted measure to its original version.Entities:
Keywords: ChEBI; Named Entity Recognition; Ontologies; Semantic Similarity
Year: 2015 PMID: 25810770 PMCID: PMC4331689 DOI: 10.1186/1758-2946-7-S1-S13
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Precision (P), Recall (R) and F-measure (F) estimates for each method used, using cross-validation, for the Chemical Documents Indexing task (CDI) and Chemical Entity Mention task (CEM).
| Run | CDI | CEM | ||||
|---|---|---|---|---|---|---|
| P | R | F | P | R | F | |
| 1 | 84.1% | 72.6% | 77.9% | 87.3% | 70.2% | 77.8% |
| 2 | 95.0% | 6.5% | 12.2% | 95.0% | 5.9% | 11.1% |
| 3 | 52.1% | 80.4% | 63.3% | 57.1% | 76.6 % | 65.4% |
| 3* | 76.7% | 75.7% | 76.2% | 80.2% | 72.8 % | 76.3% |
| 4 | 87.9% | 22.7% | 36.1% | 89.7% | 21.2% | 34.3% |
| 5 | 87.8% | 22.7% | 36.1% | 79.9% | 22.6% | 35.3% |
Precision (P), Recall (R) and F-measure (F) obtained with the test set, for the Chemical Documents Indexing task (CDI) and Chemical Entity Mention task (CEM).
| Run | CDI | CEM | ||||
|---|---|---|---|---|---|---|
| P | R | F | P | R | F | |
| 1 | 85.3% | 68.9% | 76.2% | 87.8% | 65.2% | 74.8% |
| 2 | 96.8% | 8.06% | 14.9% | 96.7% | 7.11% | 13.3% |
| 3 | 57.7% | 81.5% | 67.5% | 63.9% | 77.9 % | 70.2% |
| 4 | 91.9% | 24.4% | 38.6% | 92.9% | 22.7% | 36.4% |
| 5 | 77.1% | 27.3% | 40.3% | 79.7% | 25.0% | 38.1% |
Figure 1Average percentage of ancestors discarded using each h-index value.
Precision values obtained with each SSM for a fixed recall.
| P | R | |
|---|---|---|
| simUI | 92.97% | 20.31% |
| simUI2 | 93.14% | 20.23% |
| simUI3 | 93.01% | 19.73% |
| simUI4 | 93.10% | 19.77% |
| simUI5 | 93.35% | 19.81% |
| simUI6 | 93.00% | 20.16% |
| simGIC | 92.95% | 20.23% |
| simGIC2 | 93.14% | 20.23% |
| simGIC3 | 93.23% | 19.85% |
| simGIC4 | 93.24% | 20.09% |
| simGIC5 | 93.19% | 20.10% |
| simGIC6 | 93.10% | 19.79% |
Figure 2Comparison of precision and recall values for different thresholds between simUI and simGIC and variants with h-index ≥ 2.
Figure 3Comparison of precision and recall values for different thresholds between simUI and simGIC and variants with h-index ≥ 3.
Figure 4Comparison of precision and recall values for different thresholds between simUI and simGIC and variants with h-index ≥ 4.
Figure 5Comparison of precision and recall values for different thresholds between simUI and simGIC and variants with h-index ≥ 5.
Figure 6Comparison of precision and recall values for different thresholds between simUI and simGIC and variants with h-index ≥ 6.
Number of chemical entities from the CHEMDNER corpus not mapped to ChEBI.
| Type | Systematic | Identifier | Formula | Trivial | Abbreviation | Family | Multiple |
|---|---|---|---|---|---|---|---|
| Unmapped | 3382 | 1156 | 3972 | 3622 | 4181 | 1690 | 91 |
| (25.1%) | (88.2%) | (46.3%) | (20.3%) | (46.2%) | (20.3%) | (23.3%) | |
| Total | 13472 | 1311 | 8585 | 17802 | 9059 | 8313 | 390 |
Corpora and validation methods used for each run.
| Corpora | Validation | ||||
|---|---|---|---|---|---|
| Run | CHEMDNER | DDI/PAT | SSM | COMBINED | RF |
| 1 | X | X | X | ||
| 2 | X | X | |||
| 3 | X | X | |||
| 3* | X | ||||
| 4 | X | X | |||
| 5 | X | X | X | ||
Figure 7Section of the ChEBI ontology showing a term (CHEBI:24346) with a h-index of 2, since 2 of its child nodes have at least 2 other child nodes, and the other child node has no more than 2 child nodes.