| Literature DB >> 27570501 |
Aleksandar Savkov1, John Carroll1, Rob Koeling1, Jackie Cassell2.
Abstract
The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning.Entities:
Keywords: Annotation guidelines; Chunking; Clinical text; Corpus annotation; Named entities
Year: 2016 PMID: 27570501 PMCID: PMC4983282 DOI: 10.1007/s10579-015-9330-7
Source DB: PubMed Journal: Lang Resour Eval ISSN: 1574-020X Impact factor: 1.358
Fig. 1Patient record content diagram
Examples of examination records from the GPRD consisting of a structured entry (left) and a text note (right)
| Telephone encounter | tel from wife pt v scared re mri next wed- ok for small dose dz |
| Constipation NOS | 1 BM 3 days ago following 5 days without any. now no BM last 3 days either. breast fed baby ! o/e abd soft. no palpable faeces. try lactulose 2.5 ml bd |
| Cardiac failure therapy | Hxnsyx settled ? feels abit better OE creps R base only. jvp not seen. IMP better re fluid status, rate still ok. P cont w bloods 2/7, rv 1w |
| Had a chat to patient | re. cough at night; see letter from Mr ~~~~~ |
A non-exhaustive list of notable clinical corpora
| Corpus | Size | Document type | Annotation type |
|---|---|---|---|
| Harvey Corpus | 750 | GP notes | Syntactic chunks, four semantic annotation types |
| Uzuner et al. ( | 889 | Discharge summaries | De-identification, smoker status |
| Uzuner ( | 1237 | Discharge summaries | Present, absent, questionable for obesity + 15 comorbidities |
| Uzuner et al. ( | 1243 | Discharge summaries | Medications, dosages, frequencies, modes, reasons, durations, list/narrative |
| Uzuner et al. ( | 871 | Discharge summaries, progress reports | Concepts, assertions, relations |
| Sun et al. ( | 310 | Discharge summaries | Temporal relations |
| Roberts et al. ( | 565 k | Histopathology reports, clinical narratives, and imaging reports | Entities and relations |
| Pakhomov et al. ( | 271 | Clinical notes | POS |
| Ogren et al. ( | 160 | Outpatient notes | Concepts from a subset of SNOMED-CT |
| Voorhees and Hersh ( | 17 k | Patient visits consisting of history and physical reports, surgical pathology reports, radiology reports | Topics |
| Pestian et al. ( | 1954 | Radiology reports | ICD-9-CM codes |
| Fan et al. ( | 50 | Progress reports | POS |
| Fan et al. ( | 25 | Progress reports | Syntactic trees of ill-formed sentences |
Note that the size is reported in terms of number of documents
Fig. 2Examples illustrating correct (line three) and incorrect (lines one and two) use of embedded annotations
Fig. 3BRAT annotation showing labelled spans
Fig. 4Two different annotations of the same text
Fig. 5Inter-annotator agreement during the training period
Fig. 6a IAA for the nine annotation batches of the corpus, in the order they were annotated; b IAA of the annotation types across the whole corpus
IAA between annotators C and D on their training annotation batches
| Strict | Relaxed | |||||
|---|---|---|---|---|---|---|
| Precision | Recall | F1-score | Precision | Recall | F1-score | |
| Chunks | 0.65 | 0.64 | 0.65 | 0.82 | 0.80 | 0.81 |
| Expressions | 0.50 | 0.56 | 0.53 | 0.69 | 0.78 | 0.73 |
| All | 0.57 | 0.57 | 0.57 | 0.71 | 0.71 | 0.71 |
The results in all are calculated as micro-averages
Pairwise IAA between all annotators
| Chunks | Expressions | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AB | CD | AC | BC | AD | BD | AB | CD | AC | BC | AD | BD | |
| Pr | 0.86 | 0.82 | 0.90 | 0.81 | 0.86 | 0.85 | 0.79 | 0.60 | 1.00 | 0.79 | 0.60 | 0.50 |
| Re | 0.84 | 0.75 | 0.91 | 0.84 | 0.78 | 0.78 | 0.73 | 0.90 | 1.00 | 0.73 | 0.90 | 0.70 |
| F1
| 0.85 | 0.78 | 0.90 | 0.82 | 0.82 | 0.82 | 0.76 | 0.72 | 1.00 | 0.76 | 0.72 | 0.58 |
| Pr | 0.90 | 0.92 | 0.90 | 0.84 | 0.94 | 0.92 | 0.79 | 0.67 | 1.00 | 0.79 | 0.67 | 0.50 |
| Re | 0.88 | 0.84 | 0.92 | 0.87 | 0.85 | 0.84 | 0.73 | 1.00 | 1.00 | 0.73 | 1.00 | 0.70 |
| F1
| 0.89 | 0.88 | 0.91 | 0.86 | 0.90 | 0.88 | 0.76 | 0.80 | 1.00 | 0.76 | 0.80 | .58 |
S and R subscripts stand for strict and relaxed agreement. Columns represent annotator pairs denoted with their letters
Harvey Corpus statistics: annotation counts, average tokens per annotation, and average annotations per record
| NP | MV | AP | Chunks | TE | LE | QE | OE | NEs | All | |
|---|---|---|---|---|---|---|---|---|---|---|
| Count | 6304 | 2613 | 893 | 9810 | 605 | 481 | 321 | 73 | 1480 | 11,290 |
Fig. 7Arithmetic mean (white dot) and frequency distribution of a tokens per annotation, and b annotations per record, across all annotation types
Fig. 8A 500-fold bootstrapping learning curves generated using YamCha: a chunking and b named entity recognition. Training samples range from 25 to 675 records with a step of 25; testing samples are always set to 75 records