| Literature DB >> 29186491 |
Ergin Soysal1, Jingqi Wang1, Min Jiang1, Yonghui Wu1, Serguei Pakhomov2, Hongfang Liu3, Hua Xu1.
Abstract
Existing general clinical natural language processing (NLP) systems such as MetaMap and Clinical Text Analysis and Knowledge Extraction System have been successfully applied to information extraction from clinical text. However, end users often have to customize existing systems for their individual tasks, which can require substantial NLP skills. Here we present CLAMP (Clinical Language Annotation, Modeling, and Processing), a newly developed clinical NLP toolkit that provides not only state-of-the-art NLP components, but also a user-friendly graphic user interface that can help users quickly build customized NLP pipelines for their individual applications. Our evaluation shows that the CLAMP default pipeline achieved good performance on named entity recognition and concept encoding. We also demonstrate the efficiency of the CLAMP graphic user interface in building customized, high-performance NLP pipelines with 2 use cases, extracting smoking status and lab test values. CLAMP is publicly available for research use, and we believe it is a unique asset for the clinical NLP community.Entities:
Keywords: clinical text processing; machine learning; natural language processing
Year: 2018 PMID: 29186491 PMCID: PMC7378877 DOI: 10.1093/jamia/ocx132
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1.The user interface for building a pipeline in CLAMP.
Figure 2.The interface in CLAMP for annotating entities and relations.
Figure 3.The interface for selecting features and evaluation options for building machine learning–based NER models using CLAMP.
Performance of CLAMP-CMD on the NER task (problem, treatment, and test) across different corpora
| Corpus | No. of entities | State-of-the-art F1 (exact vs relaxed) | Exact match | Relaxed match | ||||
|---|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 | Precision | Recall | F1 | |||
| i2b2 | 72 846 | 0.85/0.92 | 0.89 | 0.86 | 0.88 | 0.96 | 0.93 | 0.94 |
| MTSamples | 25 531 | N/A | 0.84 | 0.81 | 0.83 | 0.92 | 0.89 | 0.91 |
| UTNotes | 124 869 | N/A | 0.92 | 0.90 | 0.91 | 0.96 | 0.94 | 0.95 |
aThe best performance reported in the shared task.
Performance of CLAMP (version 1.3), MetaMap (2016), MetaMap Lite (2016, version 3.4), and cTAKES (version 4) on extracting disease concepts to UMLS CUIs using the SemEval-2014 corpus
| NLP System | No. of entities | Performance | Processing Time (s/doc) | ||||
|---|---|---|---|---|---|---|---|
| Correct | Predict | Gold | Precision | Recall | F1 | ||
| CLAMP | 7228 | 9329 | 13 555 | 0.775 | 0.533 | 0.632 | 0.95 |
| MetaMap | 5574 | 10 214 | 13 555 | 0.546 | 0.411 | 0.469 | 7.07 |
| MetaMap Lite | 8009 | 11 282 | 13 555 | 0.710 | 0.591 | 0.645 | 1.95 |
| cTAKES | 9126 | 19 713 | 13 555 | 0.463 | 0.673 | 0.549 | 2.27 |
aAll evaluations were performed on a MacBook with 16G RAM and Intel i7 as CPU with 4 cores. For MetaMap, the default setting was used. For cTAKES, the fast dictionary lookup annotator was used. The performance of both MetaMap and cTAKES could be further improved by optimizing their settings.