| Literature DB >> 21658288 |
Neil Barrett1, Jens Weber-Jahnke.
Abstract
BACKGROUND: Tokenization is an important component of language processing yet there is no widely accepted tokenization method for English texts, including biomedical texts. Other than rule based techniques, tokenization in the biomedical domain has been regarded as a classification task. Biomedical classifier-based tokenizers either split or join textual objects through classification to form tokens. The idiosyncratic nature of each biomedical tokenizer's output complicates adoption and reuse. Furthermore, biomedical tokenizers generally lack guidance on how to apply an existing tokenizer to a new domain (subdomain). We identify and complete a novel tokenizer design pattern and suggest a systematic approach to tokenizer creation. We implement a tokenizer based on our design pattern that combines regular expressions and machine learning. Our machine learning approach differs from the previous split-join classification approaches. We evaluate our approach against three other tokenizers on the task of tokenizing biomedical text.Entities:
Mesh:
Year: 2011 PMID: 21658288 PMCID: PMC3111587 DOI: 10.1186/1471-2105-12-S3-S1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Tokenizer components and information flow A diagram illustrating the tokenizer’s components and information flow through these components.
Figure 2A bounded lattice representing a sentence’s segmentations An example of a bounded lattice representing a sentence’s segmentations.
Inter-segmentor agreement.
| Description | Percent Agreement | Cohen’s Kappa |
|---|---|---|
| Preliminary | 56.9 | 0.139 |
| Parentheses corrected | 94.4 | 0.888 |
| Final corrected | 95.8 | 0.916 |
Inter-segmentor agreement on SNOMED CT concept description segmentations.
Token classes derived from SNOMED CT concept descriptions.
| Class | Examples |
|---|---|
| Whitespace | |
| Independents | [ ? ) |
| Dash or Hyphen | ACHE - Acetylcholine |
| Alphabetic | Does or dental |
| Numeric | 1500 1.2 10,000 III 1/2 |
| Possessive | ’s |
| Substances | 2-chloroaniline |
| Serotypes | O128:NM |
| Abbreviations | L.H. O/E |
| Acronyms | DIY |
| Lists | Paracetamol + caffeine |
| Range | C1-4 |
| Functional names | H-987 |
Token classes derived from SNOMED CT concept descriptions.
Tokenizer results.
| Tokenizer | Accuracy (%) | Confidence Interval, 95% |
|---|---|---|
| Whitespace | 53.9 | 52.0, 55.8 |
| Specialist | 47.7 | 45.8, 49.6 |
| Medpost | 92.9 | 91.9, 93.9 |
| Adapted Viterbi, 0-order HMM | 70.8 | 69.1, 72.5 |
| Adapted Viterbi, 1st-order HMM (AV-1) | 84.6 | 83.3, 85.9 |
| AV-1 + random 10% of MedPost corpus | 92.4 (5 run avg) | 91.4, 93.4 |
Tokenizer results.