| Literature DB >> 20331846 |
Kristina M Hettne1, Antony J Williams, Erik M van Mulligen, Jos Kleinjans, Valery Tkachenko, Jan A Kors.
Abstract
BACKGROUND: Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships.Entities:
Year: 2010 PMID: 20331846 PMCID: PMC2848622 DOI: 10.1186/1758-2946-2-3
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Precision (P), recall (R) and F-score (F) of the dictionaries on the annotated corpus.
| Dictionary | Unprocessed | Filtered | Frequent terms correction | Disambiguation | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ChemSpider | 0.43 | 0.19 | 0.26 | 0.81 | 0.19 | 0.31 | 0.85 | 0.19 | 0.31 | 0.87 | 0.19 | 0.31 |
| Chemlist | 0.20 | 0.47 | 0.28 | 0.39 | 0.46 | 0.42 | 0.55 | 0.46 | 0.50 | 0.67 | 0.40 | 0.50 |
Recall values for the entity classes per dictionary.
| Entity class | ChemSpider | Chemlist |
|---|---|---|
| IUPAC (391) | 0.08 | 0.21 |
| PART (92) | 0.00 | 0.04 |
| SUM (49) | 0.25 | 0.29 |
| TRIV (414) | 0.45 | 0.80 |
| ABB (161) | 0.01 | 0.22 |
| FAM (99) | 0.02 | 0.19 |
IUPAC: multiword systematic names, PART: partial chemical names, SUM: sum formulas, TRIV: trivial names (including single word IUPAC names), ABB: abbreviations, FAM: chemical family names.
Error analysis of a random sample of max 25 false negatives from each class for ChemSpider (CS) and Chemlist (CL).
| Error type | TRIV | SUM | IUPAC | FAM | ABB | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Partial match | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Annotation error | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| Not in dictionary | 25 | 15 | 22 | 16 | 25 | 24 | 25 | 24 | 25 | 8 |
| Removed by disambiguation | 0 | 5 | 0 | 7 | 0 | 0 | 0 | 1 | 0 | 12 |
| Removed by manual check of highly frequent terms | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2 |
| Tokenization error | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 3 |
Error analysis of the false positives (percentage) for ChemSpider and Chemlist.
| Error type | False positives | |
|---|---|---|
| Partial match | 21 (64%) | 96 (41%) |
| Annotation error | 11 (33%) | 29 (13%) |
| Out of corpus scope | 1 (3%) | 79 (34%) |
| Not a chemical | 0 | 28 (12%) |