| Literature DB >> 25937941 |
Tiago Grego1, Catia Pesquita1, Hugo P Bastos1, Francisco M Couto1.
Abstract
Chemical entities are ubiquitous through the biomedical literature and the development of text-mining systems that can efficiently identify those entities are required. Due to the lack of available corpora and data resources, the community has focused its efforts in the development of gene and protein named entity recognition systems, but with the release of ChEBI and the availability of an annotated corpus, this task can be addressed. We developed a machine-learning-based method for chemical entity recognition and a lexical-similarity-based method for chemical entity resolution and compared them with Whatizit, a popular-dictionary-based method. Our methods outperformed the dictionary-based method in all tasks, yielding an improvement in F-measure of 20% for the entity recognition task, 2-5% for the entity-resolution task, and 15% for combined entity recognition and resolution tasks.Entities:
Year: 2012 PMID: 25937941 PMCID: PMC4393067 DOI: 10.5402/2012/619427
Source DB: PubMed Journal: ISRN Bioinform ISSN: 2090-7338
Figure 1Example of mapping enrichment in the corpus. The first line shows the original entities corpus, where one entity was not mapped to ChEBI. On the second line, we show the corpus after enrichment, where that entity could be mapped.
Evaluation of entity recognition, full gold standard of 18,061 chemical entities. Results of named entity recognition for each assessment and method are shown in this table. The dictionary method recognized a total of 18,683 entities while the machine-learning method recognized 13,832 entities. True positives (TP) is the amount of entity recognitions that agree with the gold standard for each assessment. Values of precision, recall, and F-measure are presented.
| Assessment | Method | TP | Precision | Recall |
|
|---|---|---|---|---|---|
| Exact matching | Dictionary | 5,868 | 31.41 | 32.49 | 31.94 |
| Machine learning | 9,094 | 65.76 | 50.35 |
| |
| Left matching | Dictionary | 6,868 | 36.76 | 38.03 | 37.38 |
| Machine learning | 9,892 | 71.53 | 54.77 |
| |
| Right matching | Dictionary | 8,015 | 42.90 | 44.38 | 43.63 |
| Machine learning | 10,419 | 75.34 | 57.69 |
| |
| Left/right matching | Dictionary | 9,015 | 48.25 | 49.91 | 49.07 |
| Machine learning | 11,217 | 81.11 | 62.11 |
| |
| Partial matching | Dictionary | 12,780 | 68.40 | 70.76 | 69.56 |
| Machine learning | 12,328 | 89.15 | 68.26 |
|
Evaluation of entity recognition, subset of the gold standard composed by 9,696 chemical entities that contain a mapping to ChEBI. Results of entity identification (named entity recognition and resolution) for each alignment and method are shown in this table. The dictionary method recognized and mapped a total of 18,683 entities while the machine-learning method recognized and mapped 10,681 entities. True positives (TP) is the amount of entity recognitions that agree with the gold standard for each assessment. Values of precision, recall, and F-measure are presented.
| Assessment | Method | TP | Precision | Recall |
|
|---|---|---|---|---|---|
| Exact matching | Dictionary | 5,651 | 30.25 | 58.28 | 38.83 |
| Machine learning | 5,830 | 54.60 | 60.13 |
| |
| Left matching | Dictionary | 5,913 | 31.65 | 60.98 | 41.67 |
| Machine learning | 6,084 | 56.98 | 62.75 |
| |
| Right matching | Dictionary | 6,158 | 32.96 | 63.51 | 43.40 |
| Machine learning | 5,948 | 55.70 | 61.34 |
| |
| Left/right matching | Dictionary | 6,435 | 34.44 | 66.37 | 45.35 |
| Machine learning | 6,307 | 59.07 | 65.05 |
| |
| Partial matching | Dictionary | 7,654 | 40.97 | 78.94 | 53.94 |
| Machine learning | 6,703 | 62.78 | 69.13 |
|
Evaluation of entity identification, subset of the gold standard composed by 9,696 chemical entities that contain a mapping to ChEBI. Results of entity identification (named entity recognition and resolution) for each alignment and method are shown in this table. The dictionary method recognized and mapped a total of 18,683 entities while the machine-learning method recognized and mapped 10,681 entities. True positives (TP) is the amount of entity recognitions that agree with the gold standard and for which the mapping also agrees with the gold standard. Values of precision, recall, and F-measure are presented.
| Assessment | Method | TP | Precision | Recall |
|
|---|---|---|---|---|---|
| Exact matching | Dictionary | 4,530 | 24.25 | 46.72 | 31.93 |
| Machine learning | 4,783 | 44.79 | 49.33 |
| |
| Left matching | Dictionary | 4,559 | 24.40 | 47.02 | 35.13 |
| Machine learning | 4,972 | 46.56 | 51.28 |
| |
| Right matching | Dictionary | 4,592 | 24.58 | 47.36 | 32.36 |
| Machine learning | 4,885 | 45.75 | 50.38 |
| |
| Left/right matching | Dictionary | 4,621 | 24.73 | 47.67 | 32.57 |
| Machine learning | 5,074 | 47.52 | 52.33 |
| |
| Partial matching | Dictionary | 5,185 | 27.75 | 53.48 | 36.54 |
| Machine learning | 5,202 | 48.72 | 53.65 |
|
Evaluation of entity resolution, subset of the gold standard composed by 9,696 chemical entities that contain a mapping to ChEBI. Results of entity resolution for each assessment and method are shown in this table. Have been considered for this evaluation only the entities successfully recognized by both methods. For an exact matching assessment, the amount of entities successfully recognized by both methods was 3,668. For the left, right, left/right, and partial matching assessments, that amount was correspondingly 4,022, 4,082, 4,455, and 5,286 entities. True Positives (TP) is the amount of those entities for which the resolution was correct, that is, the mapping agrees with the gold standard. Values of precision, recall, and F-measure are presented.
| Assessment | Method | TP | Precision | Recall |
|
|---|---|---|---|---|---|
| Exact matching | Dictionary | 3,079 | 83.94 | 31.76 | 46.08 |
| Machine learning | 3,206 | 87.40 | 33.07 |
| |
| Left matching | Dictionary | 3,215 | 79.94 | 33.16 | 46.87 |
| Machine learning | 3,381 | 84.06 | 34.87 |
| |
| Right matching | Dictionary | 3,191 | 78.17 | 32.91 | 46.32 |
| Machine learning | 3,467 | 84.93 | 35.76 |
| |
| Left/right matching | Dictionary | 3,327 | 74.68 | 34.31 | 47.02 |
| Machine learning | 3,650 | 81.93 | 37.64 |
| |
| Partial matching | Dictionary | 3,861 | 73.04 | 39.82 | 51.54 |
| Machine learning | 4,273 | 80.84 | 44.07 |
|
Figure 2Example of Whatizit. The first line shows a small example of an input to whatizit. The second line shows the output, where the identified entities were marked and mapped to ChEBI identifiers.
Example of a sequence of features, and the corresponding label (Tag).
| Token | Stem | Prefix | Suffix | Number | Tag |
|---|---|---|---|---|---|
| cosmetic | cosmet | cos | tic | No | NO |
| compositions | composit | com | ons | No | NO |
| containing | contain | con | ing | No | NO |
| colostrum | colostrum | col | rum | No | NO |
| tocopherols | tocopherol | toc | ols | No | NE |
| zinc | zinc | zin | inc | No | S-NE |
| oxide | oxid | oxi | ide | No | E-NE |
| and | and | and | and | No | NO |
| hyaluronic | hyaluron | hya | nic | No | S-NE |
| acid | acid | aci | cid | No | E-NE |