| Literature DB >> 34868618 |
Phillip Richter-Pechanski1,2,3,4, Nicolas A Geis2,4, Christina Kiriakou2, Dominic M Schwab2, Christoph Dieterich1,2,3,4.
Abstract
OBJECTIVE: A vast amount of medical data is still stored in unstructured text documents. We present an automated method of information extraction from German unstructured clinical routine data from the cardiology domain enabling their usage in state-of-the-art data-driven deep learning projects.Entities:
Keywords: Deep learning; bidirectional encoder representations from transformer; fine-tuning; medical information extraction; natural language processing; pre-trained language models
Year: 2021 PMID: 34868618 PMCID: PMC8637713 DOI: 10.1177/20552076211057662
Source DB: PubMed Journal: Digit Health ISSN: 2055-2076
Figure 1.Graphical abstract: automatic extraction of 12 cardiovascular concepts (CCs) from German discharge letters using pre-trained language models.
CCs – data analysis.
| CC | ICD-10 | Description | Instances | Uniqueness (%) |
|---|---|---|---|---|
| AP | I20 | Describes a chest pain or pressure. | 211 | 54 |
| Dyspnoe | R06.0 | Dyspnoe describes a feeling of not being able to breathe sufficiently. | 215 | 22 |
| Nykturie | R35 | Nocturia describes the need of a patient to wake up in the night to urinate. | 72 | 4 |
| Ödeme | R60 | Edema is the swelling of body tissue due to fluid retention. | 127 | 28 |
| Palpitationen | R00.2 | Palpitation describes the conscious awareness of your own heartbeat. | 136 | 17 |
| Schwindel | H81-82 | Vertigo describes the feeling of turning or swaying. | 149 | 10 |
| Synkope | R55 | Syncope describes the sudden loss of consciousness. | 168 | 8 |
| Arterielle Hypertonie | I10.* | Hypertension describes the disease when the blood pressure in the arteries is persistently elevated. | 175 | 5 |
| Hypercholesterinämie | E78.* | This describes all appearances of cholesterols or lipids, mostly expressed as cardiovascular risk factors. | 128 | 9 |
| DM | E10-14 | DM is a metabolic disorder characterized by high blood sugar levels. | 65 | 8 |
| FA | – | FA is a kind of anamnesis, which gives information about specific disease of family members. | 74 | 11 |
| Nikotinkonsum | F17.* | Describes a state of dependence on nicotine. | 111 | 11 |
Description: Distribution of CCs in CardioAnno corpus (first column) including ICD-10 code (second column), short description (third column), number of instances (fourth column) and proportion of unique instances (fifth column).
CC: cardiovascular concept; ICD-10: International Classification of Diseases, Tenth Edition; AP: Angina Pectoris; DM: diabetes mellitus; FA: familial anamnesis.
Figure 2.Discharge letter snippet annotated with CCs: text snippet of a discharge letter annotated with CCs. For example, the sequence ‘starke Druckschmerzen auf der Brust’ is annotated with the concept AP.
Figure 3.Fine-tuning BERT for cardiovascular concept extraction: input sequence ‘pectangiöse Beschwerden….’ is tokenized and embedded into a numerical representation. Each output representation T is used as input to an FFNN with a final softmax layer. For example, the token pectangiöse is labelled as a B-AP, the token Beschwerden is labelled as an I-AP sequence.
CCE – F1-score.
| CC | CRF | LSTM | BERTbase | BERTfine | BERTscratch |
|---|---|---|---|---|---|
| AP | 69 | 73 |
| 82 | 78 |
| Dyspnoe | 70 | 72 |
| 73 | 70 |
| Nykturie | 96 | 92 |
| 91 |
|
| Ödeme | 57 | 79 | 91 |
| 84 |
| Palpitation | 79 | 74 |
| 79 | 77 |
| Schwindel | 87 | 87 | 95 |
| 92 |
| Synkope | 87 | 85 | 88 |
| 88 |
| Hypertonie | 89 | 90 |
| 87 | 92 |
| Cholesterin | 86 | 89 |
| 90 | 89 |
| DM | 86 | 90 | 90 |
|
|
| FA | 81 | 77 |
| 74 | 80 |
| Nikotin | 86 | 87 | 92 | 90 |
|
| Micro average/standard deviation | 78/0.83 | 80/1.87 | 83/1.98 |
Note: Highest values are highlighted in bold type.
Description: Mean average F1-score per concept and micro average F1-score including standard deviation of the baseline classifiers (CRF and LSTM) and the three pre-trained language models (BERTbase, BERTfine and BERTscratch) in percent. F1-score is calculated by summing up F1-scores per fold and dividing it by four.
CC: cardiovascular concept; CCE: CC extraction; CRF: conditional random field; LSTM: long short-term memory; AP: Angina Pectoris; DM: Diabetes mellitus; FA: familial anamnesis.
Figure 4.Precision/recall balance of CRF and BERTfine: balance between precision and recall per cardiovascular concept of the CRF and BERTfine model. Each data point in the scatter plots represents a CC. Defining the regression line with y = b + ax, an optimal result would be r2 = 1, a slope coefficient of a = 1 and a bias b = 0.