| Literature DB >> 25810769 |
Abstract
BACKGROUND: As we are witnessing a great interest in identifying and extracting chemical entities in academic articles, many approaches have been proposed to solve this problem. In this work we describe a probabilistic framework that allows for the output of multiple information extraction systems to be combined in a systematic way. The identified entities are assigned a probability score that reflects the extractors' confidence, without the need for each individual extractor to generate a probability score. We quantitively compared the performance of multiple chemical tokenizers to measure the effect of tokenization on extraction accuracy. Later, a single Conditional Random Fields (CRF) extractor that utilizes the best performing tokenizer is built using a unique collection of features such as word embeddings and Soundex codes, which, to the best of our knowledge, has not been explored in this context before.Entities:
Keywords: Chemical Entity Extraction; Conditional Random Field; Ensemble Learning; Information Extraction
Year: 2015 PMID: 25810769 PMCID: PMC4331688 DOI: 10.1186/1758-2946-7-S1-S12
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Accuracy of multiple tokenizers when tested on the chemical entities of the test set Word Cosine Similarity.
| Measure | ChemSpot | OSCAR4 | ChemXSeer |
|---|---|---|---|
| Correct | 17149 | 20491 | 17869 |
| Split Correct | 2379 | 1744 | 3190 |
| Total Correct | 19528 | 22235 | 21059 |
| Incorrect | 5823 | 3116 | 4292 |
| Accuracy Percentage | 77.03% | 87.7% | 83.06% |
Similar words to calcium when sorted by cosine similarity.
| Word | Cosine Similarity |
|---|---|
| Ca2 | 0.838966 |
| Ca | 0.692185 |
| Thapsigargin | 0.565048 |
| Stores | 0.562570 |
| Potassium | 0.549055 |
| Magnesium | 0.539387 |
Example of word embedding clusters.
| Term | Cluster Id |
|---|---|
| Tetralinoleoyl | 8 |
| thiophosphocholine | 8 |
| Phosphoethanolamines | 8 |
| y505f | 10 |
| Vav | 10 |
| Tsad | 10 |
Soundex code for English letters.
| Soundex Code | Letters |
|---|---|
| 1 | B, F, P, V |
| 2 | C, G, J, K, Q, S, X, Z |
| 3 | D, T |
| 4 | L |
| 5 | M,N |
| 6 | R |
| No Code | A, E, I, O, U, H, W, Y |
Probability of a candidate entity conditioned on possible values of the indicator random variables for each of the three taggers used.
| ChemxSeer | OSCAR4 | ChemSpot | Probability Estimate on Dev | Probability Estimate on Train |
|---|---|---|---|---|
| 1 | 0 | 0 | 0.252 | 0.26159 |
| 0 | 1 | 0 | 0.089 | 0.08507 |
| 0 | 0 | 1 | 0.249 | 0.25588 |
| 1 | 1 | 0 | 0.82083 | 0.81755 |
| 1 | 0 | 1 | 0.72799 | 0.67361 |
| 0 | 1 | 1 | 0.55869 | 0.53267 |
| 1 | 1 | 1 | 0.93316 | 0.93386 |
Performance of the ensemble extractor on the CEM task at various confidence thresholds.
| Dataset | Threshold | Precision | Recall | F-Measure |
|---|---|---|---|---|
| Dev | 0.01 | 0.31543 | 0.8924 | 0.46611 |
| Dev | 0.24 | 0.67406 | 0.73650 | 0.70390 |
| Dev | 0.25 | 0.70871 | 0.71598 | 0.71232 |
| Dev | 0.5 | 0.79486 | 0.67544 | 0.7303 |
| Dev | 0.7 | 0.87369 | 0.55663 | 0.6800 |
| Dev | 0.8 | 0.88315 | 0.52835 | 0.66116 |
| Dev | 0.9 | 0.93316 | 0.30973 | 0.46509 |
| Train | 0.01 | 0.30711 | 0.88147 | 0.45552 |
| Train | 0.25 | 0.66208 | 0.73126 | 0.69495 |
| Train | 0.26 | 0.78473 | 0.66680 | 0.72098 |
| Train | 0.5 | 0.78473 | 0.6668 | 0.72098 |
| Train | 0.6 | 0.86928 | 0.55312 | 0.67607 |
| Train | 0.7 | 0.88266 | 0.52568 | 0.65893 |
| Train | 0.9 | 0.93386 | 0.31135 | 0.467 |
Figure 1Precision-Recall curves for CEM task on training dataset.
Figure 2Precision-Recall curves for CEM task on development dataset.
Performance of ChemXSeer 2
| Run | Precision | Recall | F Score |
|---|---|---|---|
| ChemxSeer Tagger 2.0 | 0.89569 | 0.78009 | 0.83390 |
Performance of ChemXSeer 2
| Feature | Precision | Recall | F Score |
|---|---|---|---|
| All | 0.89730 | 0.77646 | 0.83252 |
| All - NLP | 0.89749 | 0.74151 | 0.81208 |
| All - Word2Vec | 0.88993 | 0.75393 | 0.81631 |
| All - Soundex | 0.89784 | 0.77342 | 0.83100 |
| Soundex 3 | 0.88937 | 0.76869 | 0.82464 |
| Soundex 5 | 0.89647 | 0.77606 | 0.83193 |
| Soundex 7 | 0.89730 | 0.77646 | 0.83252 |
| Soundex 100 | 0.88868 | 0.76995 | 0.82507 |