| Literature DB >> 26306265 |
Chunye Wang1, Ramakrishna Akella1.
Abstract
Crucial information on a patient's physical or mental conditions is provided by mentions of disorders, such as disease, syndrome, injury, and abnormality. Identifying disorder mentions is one of the most significant steps in clinical text analysis. However, there are many surface forms of the same concept documented in clinical notes. Some are even recorded disjointedly, briefly, or intuitively. Such difficulties have challenged the information extraction systems that focus on identifying explicit mentions. In this study, we proposed a hybrid approach to disorder extraction, which leverages supervised machine learning, rule-based annotation, and an unsupervised NLP system. To identify different surface forms, we exploited rich features, especially the semantic, syntactic, and sequential features, for better capturing implicit relationships among words. We evaluated our method on the CLEF 2013 eHealth dataset. The experiments showed that our hybrid approach achieves a 0.776 F-score under strict evaluation standards, outperforming any participating systems in the Challenge.Entities:
Year: 2015 PMID: 26306265 PMCID: PMC4525272
Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc
Distribution of disorder mentions in training and test sets
| # Clinical Notes | # Total Mentions | # Unique Mentions | # Disjointed Mentions | |
|---|---|---|---|---|
| Training Data | 200 | 5721 | 2344 | 645 |
| Test Data | 100 | 5234 | 2055 | 432 |
Unique mentions are those only appear in either training or test data set.
Figure 1.The framework of a hybrid extraction system
Description and examples of features for concept extraction
| Feature Type | Description | Examples |
|---|---|---|
| BOW | words of disorder mentions in training set | “dizziness”; “edema”; “facial” |
| Orthographic features | whether a word contains capital letters, digits, special characters, etc. | contain digit (“s3”), initial capital (“B-cell”), all capital (“MR”), contain hyphen (“T-wave”), CapsAndDigits (“DM2”) |
| Morphologic features | whether a word contains certain prefix or suffix | contain anti- (“antigen”, “anticoagulation”, “anti-inflammatory”); contain -ous (“granulomatous”, “edematous”, “erythematous”) |
| Part-of-Speech | Part-of-Speech tag of a word | “epigastric ventral hernia” [JJ_JJ_NN] |
| Sequential Features | label assigned to previous word | B, I, E, S, or O |
| Semantic Type Features | semantic categories of words (defined in ontology) | “pericardial effusion” [Disease or Syndrome]; “allergies” [Pathologic Function] |
| Semantically Related Term Features | whether a word belongs to semantically related words of a disorder concept in training data (obtained from parents and/or children nodes in ontology) | whether a word is one of [“cerebral”, “degeneration”, “dementia”, “senile”, “presenile”, “aphasia”], which are semantically related words of “Alzheimer”, a disorder concept in training data |
Performance of extraction system with different feature and component settings
| SVM Features | Rules | Meta Map | P/R/F1 (strict) | ΔF1 (%) | P/R/F1(relaxed) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| BOW | Orth. | Mor. | POS | Seq. | SRT | ST | |||||
| + | 0.453/0.435/0.444 | – | 0.807/0.792/0.799 | ||||||||
| + | + | 0.462/0.476/0.469 | 0.025 (6%) | 0.817/0.813/0.815 | |||||||
| + | + | 0.465/0.473/0.469 | 0.025 (6%) | 0.814/0.806/0.810 | |||||||
| + | + | 0.480/0.507/0.493 | 0.049 (11%) | 0.799/0.816/0.807 | |||||||
| + | + | 0.500/0.486/0.493 | 0.049 (11%) | 0.859/0.823/0.841 | |||||||
| + | + | 0.466/0.481/0.473 | 0.029 (7%) | 0.809/0.813/0.811 | |||||||
| + | + | 0.611/0.556/0.582 | 0.138 (31%) | 0.876/0.790/0.831 | |||||||
| + | + | 0.571/0.611/0.590 | 0.146 (33%) | 0.825/0.865/0.845 | |||||||
| + | + | 0.506/0.531/0.518 | 0.074 (17%) | 0.815/0.834/0.824 | |||||||
| + | + | + | 0.463/0.472/0.467 | 0.023 (5%) | 0.820/0.810/0.815 | ||||||
| + | + | + | + | + | 0.636/0.575/0.604 | 0.160 (36%) | 0.887/0.791/0.836 | ||||
| + | + | + | + | + | + | + | 0.698/0.598/0.644 | 0.200 (45%) | 0.920/0.757/0.831 | ||
| + | + | + | + | + | + | + | + | 0.816/0.719/0.764 | 0.320 (72%) | 0.935/0.825/0.877 | |
| + | + | + | + | + | + | + | + | + | 0.816/0.740 | 0.332 (75%) | 0.929/0.844/0.884 |
Orth. - orthographic feature; Mor. - morphologic feature; Seq. - sequential feature.