| Literature DB >> 23355458 |
Daniel Albright1, Arrick Lanfranchi, Anwen Fredriksen, William F Styler, Colin Warner, Jena D Hwang, Jinho D Choi, Dmitriy Dligach, Rodney D Nielsen, James Martin, Wayne Ward, Martha Palmer, Guergana K Savova.
Abstract
OBJECTIVE: To create annotated clinical narratives with layers of syntactic and semantic labels to facilitate advances in clinical natural language processing (NLP). To develop NLP algorithms and open source components.Entities:
Keywords: Gold Standard Annotations; Natural Language Processing; Propbank; Treebank; UMLS; cTAKES
Mesh:
Year: 2013 PMID: 23355458 PMCID: PMC3756257 DOI: 10.1136/amiajnl-2012-001317
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1Example text from a clinical note with Treebank, PropBank and UMLS annotations.
Figure 2Example of clinical Treebank guideline changes.
Sample sentence and PropBank frame and roles
| Argument role | Predicate argument | Example: frame for ‘decrease’ | Example: ‘Dr Brown decreased the dosage of Mr Green's medications by 20 mg, from 50 mg to 30 mg, in order to reduce his nausea’ |
|---|---|---|---|
| Arg0 | Agent | Causer of decline, agent | Dr Brown |
| Arg1 | Patient | Thing decreasing | The dosage of Mr Green's medication |
| Arg2 | Instrument, benefactive, or attribute | Amount decreased by; extent or manner | By 20 mg |
| Arg3 | Starting point, benefactive, or attribute | Start point | From 50 mg |
| Arg4 | Ending point | End point | To 30 mg |
| ArgM | Modifier | ArgM-PRP (purpose): in order to reduce his nausea |
Frequency of annotations
| Annotation type | Raw annotation counts | In % |
|---|---|---|
| PropBank argument: Arg0 | 9647 | 48.47 |
| PropBank argument: Arg1 | 2901 | 14.58 |
| PropBank argument: Arg2 | 146 | 0.73 |
| PropBank argument: Arg3 | 109 | 0.55 |
| PropBank argument: Arg4 | 0 | 0.00 |
| PropBank argument: ArgM | 7098 | 35.67 |
| UMLS semantic group: Procedures | 4483 | 15.71 |
| UMLS semantic group: Disorders | 4208 | 14.74 |
| UMLS semantic group: Concepts and Ideas | 4308 | 15.10 |
| UMLS semantic group: Anatomy | 3652 | 12.80 |
| UMLS semantic type: Sign or Symptom | 3556 | 12.46 |
| UMLS semantic group: Chemicals and Drugs | 2137 | 7.49 |
| UMLS semantic group: Physiology | 1669 | 5.85 |
| UMLS semantic group: Activities and Behaviors | 990 | 3.47 |
| UMLS semantic group: Phenomena | 847 | 2.97 |
| UMLS semantic group: Devices | 282 | 0.99 |
| UMLS semantic group: Living Beings | 120 | 0.42 |
| UMLS semantic group: Objects | 103 | 0.36 |
| UMLS semantic group: Geographic Areas | 84 | 0.29 |
| UMLS semantic group: Organizations | 60 | 0.21 |
| UMLS semantic group: Occupations | 24 | 0.08 |
| UMLS semantic group: Genes and Molecular Sequences | 1 | 0.00 |
| Non-UMLS semantic category: Person | 2015 | 7.06 |
UMLS, Unified Medical Language System.
Inter-annotator agreement results (F1 measure)
| Average IAA | |
|---|---|
| Treebank | 0.926 |
| PropBank, exact | 0.891 |
| PropBank, Core-arg | 0.917 |
| PropBank, Constituent | 0.931 |
| UMLS, exact | 0.697 |
| UMLS, partial | 0.750 |
IAA, inter-annotator agreement; UMLS, Unified Medical Language System.
The distribution of the training data across the different corpora
| Part A | |||||
|---|---|---|---|---|---|
| WSJ (901 K) | WSJ (147 K) | MiPACQ (147 K) | WSJ+MiPACQ (147 K+147 K) | WSJ+MiPACQ (901 K+147 K) | |
| # Of sentences | 37015 | 6006 | 11435 | 17441 | 43021 |
| # Of word-tokens | 901673 | 147710 | 147698 | 295408 | 1049383 |
| # Of verb-predicates | 96159 | 15695 | 16776 | 32471 | 111854 |
CN, clinical notes; PA, pathology notes; WSJ, Wall Street Journal.
Evaluation on the MiPACQ corpus/SHARP corpus/THYME corpus
| Evaluation metric | WSJ (901 K) | WSJ (147 K) | MiPACQ (147 K) | WSJ+MiPACQ (147 K+147 K) | WSJ+MiPACQ (901 K+147 K) | |
|---|---|---|---|---|---|---|
| POS | ACC | 88.62 | 87.79 | 94.28 | 94.39 | 94.11 |
| 81.71 | 81.38 | 90.13 | 89.17 | 87.59 | ||
| 83.07 | 82.32 | 92.12 | 92.00 | 90.84 | ||
| DEP | UAS | 78.34 | 75.59 | 85.72 | 85.30 | 85.40 |
| 67.34 | 65.01 | 74.93 | 74.70 | 73.89 | ||
| 65.58 | 62.11 | 73.21 | 73.56 | 73.95 | ||
| LAS | 74.37 | 70.40 | 83.63 | 83.23 | 83.31 | |
| 62.63 | 59.09 | 72.19 | 71.80 | 70.35 | ||
| 60.23 | 56.33 | 70.26 | 70.76 | 70.96 | ||
| SRL | 76.98 | 74.57 | 86.58 | 87.31 | 88.17 | |
| 74.29 | 71.57 | 80.86 | 82.86 | 82.66 | ||
| 74.16 | 72.17 | 86.20 | 86.69 | 86.29 | ||
| 67.63 | 63.44 | 77.72 | 79.35 | 79.91 | ||
| 62.16 | 57.03 | 69.43 | 71.64 | 72.00 | ||
| 63.28 | 58.32 | 76.69 | 77.46 | 78.38 |
DEP, dependency parsing; LAS, labeled attachment scores; POS, part-of-speech; SLR, semantic role labeler; UAS, unlabeled attachment scores; WSJ, Wall Street Journal.