| Literature DB >> 31533810 |
Martijn G Kersloot1, Francis Lau2, Ameen Abu-Hanna3, Derk L Arts3, Ronald Cornet3.
Abstract
BACKGROUND: Information in Electronic Health Records is largely stored as unstructured free text. Natural language processing (NLP), or Medical Language Processing (MLP) in medicine, aims at extracting structured information from free text, and is less expensive and time-consuming than manual extraction. However, most algorithms in MLP are institution-specific or address only one clinical need, and thus cannot be broadly applied. In addition, most MLP systems do not detect concepts in misspelled text and cannot detect attribute relationships between concepts. The objective of this study was to develop and evaluate an MLP application that includes generic algorithms for the detection of (misspelled) concepts and of attribute relationships between them.Entities:
Keywords: Algorithms; Chart abstraction; Electronic health records; Natural language processing; SNOMED CT
Mesh:
Year: 2019 PMID: 31533810 PMCID: PMC6749652 DOI: 10.1186/s13326-019-0207-3
Source DB: PubMed Journal: J Biomed Semantics
Fig. 1A representation of a unqualified relationship between Non-small cell lung cancer and Recurrent (1) and an attribute relationship of Non-small cell lung cancer with Recurrent as Clinical course (2). The attribute relationship is modelled as a SNOMED CT concept definition diagram of recurrent non-small cell lung cancer (a new post-coordinated expression). Purple blocks represent defined concepts, blue blocks represent primitive SNOMED CT concepts and yellow blocks represent attributes. Attribute groups are represented using a white circle, and conjunctions are represented using a black dot
Fig. 2A visual representation of the cTAKES AggregatePlaintextFastUMLSProcessor pipeline
Fig. 3A visual representation of DIRECT in relation to cTAKES. API: Application programming interface. Blue blocks represent developed components, rounded blocks MLP algorithms
Fig. 4Transformation of the syntactic relationships in the sentence ‘This patient is diagnosed with recurrent nonsmall cell lung cancer’ (1) to a SNOMED CT concept (2) and a relationship between detected SNOMED CT concepts (3, Recurrent and Non-small cell lung cancer)
Variables used in the F-score equation
| Algorithm | True positive (TP) | False positive (FP) | False negative (FN) |
|---|---|---|---|
| Named-entity recognitiona | Same medical concept identified as golden standard. | Identified medical concept differs from golden standard. | Medical concept mentioned, but not identified. |
| Attribute relationship detectionb | Attribute relationship present and detected. | Attribute relationship not present, but detected. | Attribute relationship present, but not detected. |
aSNOMED CT concepts 93880001 | Primary malignant neoplasm of lung (disorder) |, 254637007 | Non-small cell lung cancer (disorder) |, and 255227004 | Recurrent (qualifier value)
bSNOMED CT relationship 93880001 | Primary malignant neoplasm of lung (disorder) |: 263502005 | Clinical course (attribute) | = 255227004 | Recurrent (qualifier value)
cSNOMED CT concept 255227004 | Recurrent (qualifier value)
Specification of the included charts
| Set | Outcome | Lung cancer | Non-small cell lung cancer | Recurrence | Relation | ||
|---|---|---|---|---|---|---|---|
| Implied | Strict | Implied | Strict | Implied | Strict | ||
| Development set ( | Positive | 27 | 23 | 17 | 17 | 10 | 6 |
| Negative | – | – | – | – | 5 | – | |
| Not listed | 3 | 7 | 13 | 13 | 15 | 13 | |
| Test set ( | Positive | 51 | 40 | 36 | 31 | 20 | 10 |
| Negative | – | – | – | – | 10 | – | |
| Not listed | 17 | 28 | 32 | 37 | 38 | 58 | |
Relation: Relationship between Lung cancer and Recurrence
Fig. 5Screenshots of DIRECT. 1. Input free text using a text field or text file(s). 2. Selection of SNOMED CT concepts to focus on and top-level concepts to include. 3. Selection of SNOMED CT attribute relationships to focus on. 4. Processing of the free text and results
Precision, recall and calculated F-scores from the evaluation outcomes
| Development set ( | Test set ( | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Implementation | Algorithm | Concept | Approach | Precision | Recall | F-score | Precision | Recall | F-score |
| UMLS2011 | Named-entity recognition | Lung cancera | Implied | 0.938 | 0.556 | 0.698 | 1.000 | 0.686 | 0.814 |
| cTAKES with UMLS 2011 | Strict | 0.938 | 0.652 | 0.769 | 0.943 | 0.825 | 0.880 | ||
| Non-small cell lung cancerb | Implied | 0.917 | 0.647 | 0.759 | 0.967 | 0.806 | 0.879 | ||
| Strict | 0.917 | 0.647 | 0.759 | 0.967 | 0.935 | 0.951 | |||
| Recurrencec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |||
| UMLS2016 | Named-entity recognition | Lung cancera | Implied | 1.000 | 0.370 | 0.541 | 1.000 | 0.569 | 0.725 |
| cTAKES with UMLS 2016 | Strict | 1.000 | 0.435 | 0.606 | 1.000 | 0.725 | 0.841 | ||
| Non-small cell lung cancerb | Implied | 1.000 | 0.294 | 0.455 | 0.947 | 0.500 | 0.655 | ||
| Strict | 1.000 | 0.294 | 0.455 | 0.947 | 0.581 | 0.720 | |||
| Recurrencec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | |||
| UMLS2016 + Dict. | Named-entity recognition | Lung cancera | Implied | 1.000 | 0.852 | 0.920 | 1.000 | 0.706 | 0.828 |
| cTAKES with UMLS 2016 and custom dictionary | Strict | 1.000 | 1.000 | 1.000 | 1.000 | 0.900 | 0.947 | ||
| Non-small cell lung cancerb | Implied | 1.000 | 0.765 | 0.867 | 0.957 | 0.611 | 0.746 | ||
| Strict | 1.000 | 0.765 | 0.867 | 0.957 | 0.710 | 0.815 | |||
| Recurrencec | 1.000 | 1.000 | 1.000 | 0.882 | 1.000 | 0.938 | |||
| DIRECT | Named-entity recognition | Lung cancera | Implied | 1.000 | 0.852 | 0.920 | 1.000 | 0.706 | 0.828 |
| cTAKES with UMLS 2016, custom dictionary, and additional processing | Strict | 1.000 | 1.000 | 1.000 | 1.000 | 0.900 | 0.947 | ||
| Non-small cell lung cancerb | Implied | 1.000 | 1.000 | 1.000 | 0.966 | 0.778 | 0.862 | ||
| Strict | 1.000 | 1.000 | 1.000 | 0.966 | 0.903 | 0.933 | |||
| Recurrencec | 1.000 | 1.000 | 1.000 | 0.879 | 0.967 | 0.921 | |||
| Attribute relationship detection | Recurrent lung cancerd | 1.000 | 1.000 | 1.000 | 1.000 | 0.750 | 0.857 | ||
aSNOMED CT concept 93880001 | Primary malignant neoplasm of lung (disorder)
bSNOMED CT concept 254637007 | Non-small cell lung cancer (disorder)
cSNOMED CT concept 255227004 | Recurrent (qualifier value)
dRelationship between three SNOMED CT concepts: 93880001 | Primary malignant neoplasm of lung (disorder) |: 263502005 | Clinical course (attribute) | = 255227004 | Recurrent (qualifier value)
F-scores calculated from the evaluation outcomes
| Development set ( | Test set ( | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Algorithm | Concept | Approach | UMLS2011 | UMLS2016 | UMLS2016D | DIRECT | UMLS2011 | UMLS2016 | UMLS2016D | DIRECT |
| Named-entity recognition | Lung cancera | Implied | 0.698 | 0.541 | 0.920 | 0.920 | 0.814 | 0.725 | 0.828 | 0.828 |
| Strict | 0.769 | 0.606 | 1.000 | 1.000 | 0.880 | 0.841 | 0.947 | 0.947 | ||
| Non-small cell lung cancerb | Implied | 0.759 | 0.455 | 0.867 | 1.000 | 0.879 | 0.655 | 0.746 | 0.862 | |
| Strict | 0.759 | 0.455 | 0.867 | 1.000 | 0.951 | 0.720 | 0.815 | 0.933 | ||
| Recurrencec | 0.000 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 | 0.938 | 0.921 | ||
| Relationship detection | Recurrent lung cancerd | 1.000 | 0.857 | |||||||
aSNOMED CT concept 93880001 | Primary malignant neoplasm of lung (disorder)
bSNOMED CT concept 254637007 | Non-small cell lung cancer (disorder)
cSNOMED CT concept 255227004 | Recurrent (qualifier value)
dRelationship between three SNOMED CT concepts: 93880001 | Primary malignant neoplasm of lung (disorder) |: 263502005 | Clinical course (attribute) | = 255227004 | Recurrent (qualifier value)
Outcomes of the two-tailed permutation test between the different implementations Statistically significant values (p < 0.05) are in bold face
| Implementation 1 | Implementation 2 | Lung cancera | Non-small cell lung cancerb | Recurrencec |
|---|---|---|---|---|
| UMLS2011 | UMLS2016 |
|
| 1.000 |
| UMLS2011 | UMLS2016 + Dict. | 1.000 |
|
|
| UMLS2011 | DIRECT | 1.000 | 1.000 |
|
| UMLS2016 | UMLS2016 + Dict. |
| 0.142 |
|
| UMLS2016 | DIRECT |
|
|
|
| UMLS2016 + Dict. | DIRECT | 1.000 |
| 1.000 |
aSNOMED CT concept 93880001 | Primary malignant neoplasm of lung (disorder)
bSNOMED CT concept 254637007 | Non-small cell lung cancer (disorder)
cSNOMED CT concept 255227004 | Recurrent (qualifier value)
Contents of the custom dictionary
| CUI | Concept description | Type | Type description | Words |
|---|---|---|---|---|
| C2945760 | Recurrent | T079 | Temporal Concept | recurrent |
| C2945760 | Recurrent | T079 | Temporal Concept | recurring |
| C2945760 | Recurrent | T079 | Temporal Concept | recurrence |
| C0007131 | Non-Small Cell Lung Carcinoma | T191 | Neoplastic Process | non-small cell lung carcinoma |
| C0007131 | Non-Small Cell Lung Carcinoma | T191 | Neoplastic Process | nonsmall cell lung carcinoma |
| C0007131 | Non-Small Cell Lung Carcinoma | T191 | Neoplastic Process | non-small cell lung adenocarcinoma |
| C0007131 | Non-Small Cell Lung Carcinoma | T191 | Neoplastic Process | nonsmall cell lung adenocarcinoma |
| C0007131 | Non-Small Cell Lung Carcinoma | T191 | Neoplastic Process | non-small cell carcinoma of the lung |
| C0007131 | Non-Small Cell Lung Carcinoma | T191 | Neoplastic Process | nonsmall cell carcinoma of the lung |
| C0007131 | Non-Small Cell Lung Carcinoma | T191 | Neoplastic Process | non-small cell adenocarcinoma of the lung |
| C0007131 | Non-Small Cell Lung Carcinoma | T191 | Neoplastic Process | nonsmall cell adenocarcinoma of the lung |
| C0007131 | Non-Small Cell Lung Carcinoma | T191 | Neoplastic Process | non small cell lung carcinoma |
| C0007131 | Non-Small Cell Lung Carcinoma | T191 | Neoplastic Process | non small cell lung adenocarcinoma |
| C0007131 | Non-Small Cell Lung Carcinoma | T191 | Neoplastic Process | non small cell carcinoma of the lung |
| C0007131 | Non-Small Cell Lung Carcinoma | T191 | Neoplastic Process | non small cell adenocarcinoma of the lung |
| C0149925 | Small cell carcinoma of lung | T191 | Neoplastic Process | small cell lung carcinoma |
| C0149925 | Small cell carcinoma of lung | T191 | Neoplastic Process | small cell lung adenocarcinoma |
| C0149925 | Small cell carcinoma of lung | T191 | Neoplastic Process | small cell carcinoma of the lung |
| C0149925 | Small cell carcinoma of lung | T191 | Neoplastic Process | small cell adenocarcinoma of the lung |
| C1306460 | Primary malignant neoplasm of lung | T191 | Neoplastic Process | lung carcinoma |
| C1306460 | Primary malignant neoplasm of lung | T191 | Neoplastic Process | lung adenocarcinoma |
| C1306460 | Primary malignant neoplasm of lung | T191 | Neoplastic Process | carcinoma of the lung |
| C1306460 | Primary malignant neoplasm of lung | T191 | Neoplastic Process | adenocarcinoma of the lung |
CUI Unified Medical Language System's Concept Unique Identifier
Evaluation outcomes of the different implementations and datasets
| Named-entity recognition | Attribute relation detection | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Implementation | Set | Approach | Outcome | Lung cancera | Non-small cell lung cancerb | Recurrencec | Recurrent lung cancerd | |||
| Present | Absent | Present | Absent | Present | Present | Absent | ||||
| 1 | Development set ( | Relaxed | Detected | 15 | 1 | 11 | 1 | 0 | ||
| cTAKES with UMLS 2011 | Not detected | 12 | 2 | 6 | 12 | 15 | ||||
| Strict | Detected | 15 | 1 | 11 | 1 | |||||
| Not detected | 8 | 6 | 6 | 12 | ||||||
| Test set ( | Relaxed | Detected | 35 | 0 | 29 | 1 | 0 | |||
| Not detected | 16 | 17 | 7 | 31 | 30 | |||||
| Strict | Detected | 33 | 2 | 29 | 1 | |||||
| Not detected | 7 | 26 | 2 | 36 | ||||||
| 2 | Development set ( | Relaxed | Detected | 10 | 0 | 5 | 0 | 0 | ||
| cTAKES with UMLS 2016 | Not detected | 17 | 3 | 12 | 13 | 15 | ||||
| Strict | Detected | 10 | 0 | 5 | 0 | |||||
| Not detected | 13 | 7 | 12 | 13 | ||||||
| Test set ( | Relaxed | Detected | 29 | 0 | 18 | 1 | 0 | |||
| Not detected | 22 | 17 | 18 | 31 | 30 | |||||
| Strict | Detected | 29 | 0 | 18 | 1 | |||||
| Not detected | 11 | 28 | 13 | 36 | ||||||
| 3 | Development set ( | Relaxed | Detected | 23 | 0 | 13 | 0 | 15 | ||
| cTAKES with UMLS 2016 and custom dictionary | Not detected | 4 | 3 | 4 | 13 | 0 | ||||
| Strict | Detected | 23 | 0 | 13 | 0 | |||||
| Not detected | 0 | 7 | 4 | 13 | ||||||
| Test set ( | Relaxed | Detected | 36 | 0 | 22 | 1 | 30 | |||
| Not detected | 15 | 17 | 14 | 31 | 0 | |||||
| Strict | Detected | 36 | 0 | 22 | 1 | |||||
| Not detected | 4 | 28 | 9 | 36 | ||||||
| 4 | Development set ( | Relaxed | Detected | 23 | 0 | 17 | 0 | 15 | 6 | 0 |
| DIRECT cTAKES with UMLS 2016, custom dictionary, and post-processing | Not detected | 4 | 3 | 0 | 13 | 0 | 0 | 10 | ||
| Strict | Detected | 23 | 0 | 17 | 0 | |||||
| Not detected | 0 | 7 | 0 | 13 | ||||||
| Test set ( | Relaxed | Detected | 36 | 0 | 28 | 1 | 29 | 9 | 0 | |
| Not detected | 15 | 17 | 8 | 31 | 1 | 3 | 18 | |||
| Strict | Detected | 36 | 0 | 28 | 1 | |||||
| Not detected | 4 | 28 | 3 | 36 | ||||||
aSNOMED CT concept 93880001 | Primary malignant neoplasm of lung (disorder)
bSNOMED CT concept 254637007 | Non-small cell lung cancer (disorder)
cSNOMED CT concept 255227004 | Recurrent (qualifier value)
dRelationship between three SNOMED CT concepts: 93880001 | Primary malignant neoplasm of lung (disorder) |: 263502005 | Clinical course (attribute) | = 255227004 | Recurrent (qualifier value)