| Literature DB >> 26407642 |
Tao Chen1, Richard M Cullen2, Marshall Godwin3.
Abstract
For the 2014 i2b2/UTHealth de-identification challenge, we introduced a new non-parametric Bayesian hidden Markov model using a Dirichlet process (HMM-DP). The model intends to reduce task-specific feature engineering and to generalize well to new data. In the challenge we developed a variational method to learn the model and an efficient approximation algorithm for prediction. To accommodate out-of-vocabulary words, we designed a number of feature functions to model such words. The results show the model is capable of understanding local context cues to make correct predictions without manual feature engineering and performs as accurately as state-of-the-art conditional random field models in a number of categories. To incorporate long-range and cross-document context cues, we developed a skip-chain conditional random field model to align the results produced by HMM-DP, which further improved the performance.Entities:
Keywords: De-identification; Dirichlet process; Hidden Markov model; Natural language processing; Variational method
Mesh:
Year: 2015 PMID: 26407642 PMCID: PMC4984397 DOI: 10.1016/j.jbi.2015.09.004
Source DB: PubMed Journal: J Biomed Inform ISSN: 1532-0464 Impact factor: 6.317