| Literature DB >> 26306286 |
Abstract
In clinical text mining, it is one of the biggest challenges to represent medical terminologies and n-gram terms in sparse medical reports using either supervised or unsupervised methods. Addressing this issue, we propose a novel method for word and n-gram representation at semantic level. We first represent each word by its distance with a set of reference features calculated by reference distance estimator (RDE) learned from labeled and unlabeled data, and then generate new features using simple techniques of discretization, random sampling and merging. The new features are a set of binary rules that can be interpreted as semantic tags derived from word and n-grams. We show that the new features significantly outperform classical bag-of-words and n-grams in the task of heart disease risk factor extraction in i2b2 2014 challenge. It is promising to see that semantics tags can be used to replace the original text entirely with even better prediction performance as well as derive new rules beyond lexical level.Entities:
Year: 2015 PMID: 26306286 PMCID: PMC4525271
Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc
Figure 1.An example of RDSM algorithm for semantic tag generation. The reference features are for the task of identifying sentences related to medications in i2b2 2014. Features in red color are final features for classification.
Lexical V.S. semantic representation
| Features | Precision / Recall / F1 / AUC (Macro, Sentence-level) | Precision / Recall / F1 (Micro, i2b2 2014) |
|---|---|---|
| Words | 0.6815 / 0.7105 / 0.6717 / 0.9572 | 0.8546 / 0.8276 / 0.8409 |
| Semantic tags from words | ||
| Bigrams | 0.6056 / 0.5702 / 0.5579 / 0.8814 | 0.7989 / 0.7368 / 0.7666 |
| Semantic tags from bigrams | 0.6168 / 0.6452 / 0.6144 / 0.9593 | 0.817 / 0.7838 / 0.8001 |
| Trigrams | 0.5227 / 0.4164 / 0.4452 / 0.7792 | 0.7222 / 0.6183 / 0.6662 |
| Semantic tags from trigrams | 0.5936 / 0.6018 / 0.5831 / 0.9504 | 0.755 / 0.7509 / 0.753 |
Figure 2.Relation between RDSM parameters and prediction performance. Macro F1 of 39 binary classification tasks was used for evaluation. Semantic features for word were applied.
Examples of indicative rules discovered by words and semantic tags. The numbers in {} are F1 scores in training and testing data respectively obtained by individual rule. Each semantic tag is the conjunction of several pairs in the format (reference feature, distance). For distance, 0 for (−∞, 0], 1 for (0, 0.005], and 2 for (0.005, +∞) (See Method section).
| Words | Semantic tags from words | |
|---|---|---|
| mg {0.44, 0.38} | (insulin,1)&(bid,1)&(acetylsalicylic,1)&(s,2)&(captopril,0)&(increase,2)&(zetia,0)&(twice,0)&(blockade,1)&(x1,1) | |
| cad{0.35,0.35} | (collaterals,0)&(male,0)&(2144,0)&(dm,0)&(stented,1)&(ramus,0)&(which,1)&(2148,0)&(2080the,0)&(depresion,0) { |