| Literature DB >> 26942193 |
Abbas Akkasi1, Ekrem Varoğlu1, Nazife Dimililer2.
Abstract
Named Entity Recognition (NER) from text constitutes the first step in many text mining applications. The most important preliminary step for NER systems using machine learning approaches is tokenization where raw text is segmented into tokens. This study proposes an enhanced rule based tokenizer, ChemTok, which utilizes rules extracted mainly from the train data set. The main novelty of ChemTok is the use of the extracted rules in order to merge the tokens split in the previous steps, thus producing longer and more discriminative tokens. ChemTok is compared to the tokenization methods utilized by ChemSpot and tmChem. Support Vector Machines and Conditional Random Fields are employed as the learning algorithms. The experimental results show that the classifiers trained on the output of ChemTok outperforms all classifiers trained on the output of the other two tokenizers in terms of classification performance, and the number of incorrectly segmented entities.Entities:
Mesh:
Year: 2016 PMID: 26942193 PMCID: PMC4749772 DOI: 10.1155/2016/4248026
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1ChemTok Algorithm.
Rules used in Step 3 of the algorithm.
| Rule number | Rule explanation | Example | |
|---|---|---|---|
| Tokens after Step 2 | Merged token | ||
| 1 | Numeric tokens which are separated by “.” or “,” or “/” or “-” or “_” are integrated into a single token. | 125 | 125,12,12 |
|
| |||
| 2 | If concatenated tokens from Rule 1 are surrounded by balanced containers such as parentheses, braces, and brackets, both container tokens are conjoined into the token. | ( | (1-3) |
|
| |||
| 3 | Single uppercase tokens which are followed by sequence of lowercase letters as the next token are recombined to a single token. | C | Common |
|
| |||
| 4 | If the concatenation of consecutive tokens is found in the list of known chemical names, they are merged into one token. | Na | NaCL |
|
| |||
| 5 | Apply the plurality rule to the tokens | Acids | Acids |
Details of BioCreative data set.
| Data set | Number of abstracts | Number of sentences | Number of NEs in each class | Total number of NEs | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Systematic | Abbreviation | Family | Formula | Identifier | Multiple | Trivial | ||||
| Train | 3500 | 30418 | 6656 | 4538 | 4090 | 4448 | 672 | 202 | 8832 | 29438 |
| Development | 3500 | 30445 | 6816 | 4521 | 4223 | 4137 | 639 | 188 | 8970 | 29494 |
| Test | 3000 | 8655 | 5666 | 4059 | 3622 | 3443 | 513 | 199 | 7808 | 25310 |
Details of DDI corpus.
| Data set | Number of documents | Number of sentences | Number of NEs in each class | Total number of Named Entities | |||
|---|---|---|---|---|---|---|---|
| Drug | Group | Brand | Drug_n | ||||
| Train | |||||||
| DrugBank | 572 | 5675 | 8197 | 3206 | 1423 | 103 | 12929 |
| Medline | 142 | 1301 | 1228 | 193 | 14 | 401 | 1836 |
| Test | |||||||
| DrugBank | 54 | 145 | 180 | 65 | 53 | 5 | 303 |
| Medline | 58 | 520 | 171 | 90 | 6 | 115 | 382 |
Comparison of number of tokens, average token length, and number of incorrectly segmented entities for various tokenizers.
| Data set | ChemSpot | tmVar | ChemTok | White space Tokenizer | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NT | ATL | NISE | NT | ATL | NISE | NT | ATL | NISE | NT | ATL | NISE | |
| Chem DNER | ||||||||||||
| Train | 907405 | 4.62 | 40 | 965056 | 4.35 | 11 | 899343 | 4.66 |
| 718244 | 5.84 | 9189 |
| Development | 901610 | 4.64 | 36 | 958475 | 4.36 | 11 | 893180 | 4.68 |
| 714287 | 5.85 | 9174 |
| Test | 779700 | 4.63 | 8 | 828001 | 4.36 |
| 772847 | 4.67 |
| 513630 | 5.85 | 7804 |
| DrugBank | ||||||||||||
| Train | 127435 | 5.06 | 50 | 135625 | 4.76 | 48 | 126753 | 5.09 |
| 107409 | 6.00 | 4623 |
| Test | 3189 | 5.12 | 1 | 3407 | 4.79 | 1 | 3174 | 5.14 |
| 2665 | 6.12 | 116 |
| Medline | ||||||||||||
| Train | 32625 | 4.77 | 2 | 34178 | 4.55 | 2 | 32259 | 4.82 |
| 27066 | 5.75 | 431 |
| Test | 12978 | 4.85 |
| 13673 | 4.61 |
| 12875 | 4.89 |
| 10839 | 5.11 | 96 |
NT: number of tokens, ATL: average token length, and NISE: number of incorrectly segmented entities.
Features used for training classifiers.
| Feature set | Actual features in the feature set | Number of features used in set |
|---|---|---|
| Space features | Has right space, has left space, and has both right and left space | 3 |
|
| ||
| Context words | One token before and one token after current token | 2 |
|
| ||
| n-gram affixes | n-gram affixes (prefixes + suffixes) for | 8 |
|
| ||
| Word shapes | Word shape (number of uppercase, lowercase letters, digits, punctuation, and Greeks), digital word shape (word shape in digital format), and summarized word shape (combination of two aforementioned features) | 3 |
|
| ||
| Orthographic features | All uppercase, has slash, has punctuation, has real number, starts with digit, starts with uppercase, has more than 2 uppercase letters | 7 |
|
| ||
| Token length | Number of characters in the token | 1 |
|
| ||
| Common chemical prefixes and suffixes | Contains chemical affixes from the list of chemical affixes in [ | 1 |
NER performance (F-score in %) of classifiers using BioCreative data set.
| Tokenizer | Classification algorithm | |||
|---|---|---|---|---|
| CRF | SVM | |||
| Development | Test | Development | Test | |
| White space | 75.39 | 75.44 | 75.65 | 75.67 |
| ChemSpot | 78.46 | 78.89 | 83.26 | 82.88 |
| tmVar | 76.15 | 76.50 | 82.29 | 82.27 |
| ChemTok |
|
|
|
|
NER performance (F-score in %) of classifiers using DrugBank and Medline corpora of DDI SemEval data set.
| Data set | Tokenizer | Classification algorithm | |
|---|---|---|---|
| CRF | SVM | ||
| DrugBank | White space | 77.89 | 82.85 |
| ChemSpot | 87.16 | 89.10 | |
| tmVar | 84.74 | 90.34 | |
| ChemTok |
|
| |
|
| |||
| Medline | White space | 51.51 | 42.41 |
| ChemSpot | 62.72 | 67.48 | |
| tmVar | 62.04 | 67.50 | |
| ChemTok |
|
| |
Class based performance (F-score in %) for BioCreative corpus using various tokenizers.
| Algorithm | Entity type | Development set | Test set | ||||
|---|---|---|---|---|---|---|---|
| ChemSpot | tmVar | ChemTok | ChemSpot | tmVar | ChemTok | ||
| CRF | Abbreviation | 68.14 | 66.58 |
| 67.20 | 65.42 |
|
| Family | 69.22 | 67.59 |
| 71.94 | 70.60 |
| |
| Formula | 76.57 | 69.70 |
| 75.29 | 69.81 |
| |
| Identifier | 63.03 | 59.45 |
| 63.88 | 61.55 |
| |
| Multiple | 32.50 | 26.86 |
| 32.77 | 30.50 |
| |
| Systematic | 79.41 | 78.09 |
| 79.95 | 78.33 |
| |
| Trivial | 85.62 | 84.11 |
| 85.52 | 83.69 |
| |
|
| |||||||
| SVM | Abbreviation | 72.59 | 72.12 |
| 72.42 | 71.65 |
|
| Family | 69.82 | 69.69 |
| 71.81 | 71.57 |
| |
| Formula | 82.667 | 81.61 |
| 82.15 | 81.68 |
| |
| Identifier | 72.08 | 69.76 |
| 74.60 | 74.76 |
| |
| Multiple | 36.06 | 34.31 |
| 26.89 | 20.96 |
| |
| Systematic | 82.33 | 81.51 |
| 82.10 | 81.49 |
| |
| Trivial | 86.73 | 85.85 |
| 86.50 | 86.14 |
| |
Class based performance (F-score in %) for SemEval DDI data set; DrugBank, Medline.
| Algorithm | Entity type | DrugBank | Medline | ||||
|---|---|---|---|---|---|---|---|
| ChemSpot | tmVar | ChemTok | ChemSpot | tmVar | ChemTok | ||
| CRF | Group | 76.33 | 72.86 |
| 62.41 | 59.25 |
|
| Drug_n | 0.0 | 0.0 |
| 10.44 |
| 12.48 | |
| Brand | 86.31 | 80.85 |
| 0.0 | 0.0 | 0.0 | |
| Drug | 89.77 | 86.85 |
| 74.57 | 74.22 |
| |
|
| |||||||
| SVM | Group | 83.82 | 83.58 |
| 46.28 | 44.06 |
|
| Drug_n | 0.0 | 0.0 | 0.0 | 10.93 | 11.02 |
| |
| Brand | 92.15 |
| 93.45 | 0.0 | 0.0 | 0.0 | |
| Drug | 91.66 | 89.32 |
| 68.04 | 67.06 |
| |
| inhibition | O |
| of | O |
| NF | B-CHEMICAL |
| - | I-CHEMICAL |
| Kappa | I-CHEMICAL |
| B | I-CHEMICAL |
| activation | O |