Literature DB >> 35104909

Disambiguating Clinical Abbreviations Using a One-Fits-All Classifier Based on Deep Learning Techniques.

Abstract

BACKGROUND: Abbreviations are considered an essential part of the clinical narrative; they are used not only to save time and space but also to hide serious or incurable illnesses. Misreckoning interpretation of the clinical abbreviations could affect different aspects concerning patients themselves or other services like clinical support systems. There is no consensus in the scientific community to create new abbreviations, making it difficult to understand them. Disambiguate clinical abbreviations aim to predict the exact meaning of the abbreviation based on context, a crucial step in understanding clinical notes.
OBJECTIVES: Disambiguating clinical abbreviations is an essential task in information extraction from medical texts. Deep contextualized representations models showed promising results in most word sense disambiguation tasks. In this work, we propose a one-fits-all classifier to disambiguate clinical abbreviations with deep contextualized representation from pretrained language models like Bidirectional Encoder Representation from Transformers (BERT).
METHODS: A set of experiments with different pretrained clinical BERT models were performed to investigate fine-tuning methods on the disambiguation of clinical abbreviations. One-fits-all classifiers were used to improve disambiguating rare clinical abbreviations.
RESULTS: One-fits-all classifiers with deep contextualized representations from Bioclinical, BlueBERT, and MS_BERT pretrained models improved the accuracy using the University of Minnesota data set. The model achieved 98.99, 98.75, and 99.13%, respectively. All the models outperform the state-of-the-art in the previous work of around 98.39%, with the best accuracy using the MS_BERT model.
CONCLUSION: Deep contextualized representations via fine-tuning of pretrained language modeling proved its sufficiency on disambiguating clinical abbreviations; it could be robust for rare and unseen abbreviations and has the advantage of avoiding building a separate classifier for each abbreviation. Transfer learning can improve the development of practical abbreviation disambiguation systems. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/).

Entities: Chemical

Mesh：

Year: 2022 PMID： 35104909 PMCID： PMC9246508 DOI： 10.1055/s-0042-1742388

Source DB: PubMed Journal: Methods Inf Med ISSN： 0026-1270 Impact factor: 1.800

Introduction

Abbreviations are defined as a short form of a word or a phrase. They are extensively used in clinical notes, sometimes for saving time and space and sometimes to hide inconvenient information. Recent studies show that abbreviations constitute 30 to 50% of word count in clinical notes, such as doctor's notes. 1 However, abbreviations are highly ambiguous, which means that one abbreviation has more than one meaning (sense); a recent study shows that nearly one-third of abbreviations are ambiguous. 2 A survey reveals that 43% of a set of abbreviations were identified correctly among more than 200 health care professionals. 3 Their ambiguity is due to several factors; it depends on where they are used (local vs. global scope), and the fact that there are no standard rules for creating them. Usually, abbreviations are created from the first letters of words of their definitions. Furthermore, it could contain special characters, punctuation marks like 5-FU (5-fluorouracil), or a number like T1 (spin-lattice relaxation time). For example, “AV” could mean “Aortic valve,” “Atrioventricular” which is related to the heart, or “Anteverted” related to the uterus. Determining its exact meaning depends on the context in which it is used. Recognizing and expanding clinical abbreviations is essential for health care systems, which could help clinicians make decisions, predict health outcomes, and improve quality of care. 4 Moreover, disambiguating clinical abbreviations can help physicians, nurses, and patients understand them and prevent medically dangerous misinterpretations. 5 Disambiguating abbreviations is considered as a type of word sense disambiguation (WSD) task. In natural language processing (NLP), WSD is the task of determining the correct sense of a word based on the surrounding context. WSD is considered a classification problem, in which given an ambiguous word and every possible meaning, the goal is to classify it into one of its sense's classes based on the context. Disambiguating clinical abbreviations required annotated data (corpora). The annotation process is considered tedious, expensive, and time consuming. 6 Additionally, due to mainly privacy issues, few annotated data are available. Most previous works 7 8 9 were based on building a separate classifier to disambiguate each abbreviation but this approach is considered insufficient for rare and unseen abbreviations. Hence, building generalizable methods such as having one classifier of all abbreviations could improve disambiguating unusual and unseen abbreviations. Three main approaches are used for WSD. 10 First, knowledge-based (KB) approaches are based on existing lexical resources such as dictionaries or semantic networks that usually do not have enough coverage. 11 KB approaches do not require annotated data since they exploit the graph structure of semantic networks to determine the most practical sense. Furthermore, this approach is still not applicable to clinical abbreviations because no standard lexical resources could cover all continuously emerging senses related to clinical abbreviations. Second, unsupervised methods assume that similar senses occur in similar contexts; this approach was applied to generate senses inventory by applying Tight Clustering for Rare Senses. 12 Third, supervised approaches use a set of training sets containing several examples like (X , Y ), …(X , Y ), where X is a vector feature that represents the target abbreviation, and Y represents the correct class (expansion or sense) for the target abbreviation. Supervised methods depend heavily on extensive manually annotated data. Most existing WSD systems to disambiguate clinical abbreviations are supervised methods that build a specific classifier for each abbreviation in the data set. 13 14 They cannot generalize across abbreviations and therefore require sufficient sense-annotated data for every abbreviation to perform an adequate disambiguation. 15 The surrounding context of an ambiguous word, which is known as a sequence, plays a vital role in providing a clear evidence for this classification. 16 Extracting the features of this context could be performed in different ways. In the traditional feature-based approaches, the features are collected from each word in a sequence individually, like the part-of-speech (POS) tag, the position of the word (left or right), and how far from the ambiguous word. 17 In recent years, low-dimensional word representation vectors 14 are trained on unannotated text data to generate static word embeddings, which are used in WSD tasks. These representations are fed into a neural network to capture the whole sentence representation and each word representation. However, in both approaches, the word representations are independent of the context and it could not be changed based on the context in which it appears. Pretrained contextualized representations of the sequence, like ELMo 18 and Bidirectional Encoder Representation from Transformers (BERT), 19 are recent approaches that have proved their efficiency on downstream NLP tasks 20 21 and outperform many state-of-the-art architectures in these tasks. BERT representations are obtained by training a huge amount of raw text on neural encoders in two generic tasks (predicting the next word in a sentence and predicting the next sentence in a text). 19 These models can be fine-tuned on new downstream tasks such as named entity recognition 22 or sentence classification 23 by further training steps with small annotated data. The main advantage of these pretrained models is that they provide a different representation for each word, thus determining the different meanings for this word. In this work, a BERT pretrained language model was fine-tuned using the next sentence prediction (NSP) approach with a one-fits-all classifier to disambiguate clinical abbreviations. Our work improved the accuracy and outperformed the state-of-the-art of this task.

Objectives

Abbreviations are generally used to save time and space while writing in the patients' medical records. Some challenging issues are: (1) mostly, clinical abbreviations are not accompanied by expansions in the text as it happens in biomedical literature, (2) different rules are used to create them; Table 1 shows some examples of rules used, (3) highly ambiguous terms (few abbreviations with only one expansion), (4) the scope of abbreviations concerning the community using them; there are international standardized resources or national ones, like the list of medical abbreviations proposed by the Spanish Ministry of Health, 24 but others have a more local scope that could be abbreviations used by just one hospital or even just one clinician, and (5) bilingual pairs, for instance, PSA (prostatic-specific antigen) English abbreviation is used in Spanish clinical narrative instead of APE ( antígeno prostático específico ).

Table 1

Some examples of how abbreviations are formed

Rule	Abbreviation	Sense
Truncating the end of long form	DIP	DIP ropionate
First letter initialization from each word	VBG	V enous b lood g as
Syllabic initialization	US	U ltra S ound
Combination if the beginning of some of the words of long form	Ad lib	Ad lib itum
Symbols/synonyms substitution or initialization	T3	Triiodothyronine

Research in developing automatic expansion processes is needed to minimize the side effect of misunderstanding abbreviations in clinical documentation and avoid patient safety issues. This work aims to develop a single classifier model that disambiguates clinical abbreviations using the recent existing language models that deals with rare senses and proves its efficiency in many WSD tasks, and outperforms the state-of-the-art tasks.

Methods

Clinical abbreviation disambiguation can be solved as a multiclassification task. It takes a text with previously recognized abbreviations as input and outputs the correct expansion (sense) of each abbreviation. In our approach, we propose different pretrained Bidirectional Encoder Representations from Transformers (BERT) models to represent the input texts. The method has three stages that are described below: (1) prepare a data set to fit input requirements for a pretrained model, (2) adjust the pretrained model's weights and do further training processes on the data set, and (3) apply a neural classifier.

Dataset Description

This study uses one of the few publicly annotated clinical notes data sets created by the University of Minnesota-affiliated (UMN) Fairview Health Services in the Twin Cities. 25 It was collected from admission notes, inpatient consult notes, and discharge summaries between 2004 and 2008. The data set contains 75 abbreviations of the most frequent acronyms and abbreviations from this clinical repository. Each abbreviation has 500 sentences that are annotated with different senses. There are 351 senses with an average of 4.7 senses per abbreviation (highly ambiguous abbreviations). Table 2 illustrates the senses distributions of three abbreviations (“AMA,” “BAL,” and “OTC”) taken from the UMN data set.

Table 2

Sample of University of Minnesota (UMN) data set abbreviations with its number of senses and distributions

Abbreviations	Sentences	Tokens	Senses	Senses no.	Senses (%)
AMA	2,881	37,887	Against medical advice	444	88.8
			Advanced maternal age	31	6.2
			Antimitochondrial antibody	25	5.0
BAL	3,267	38,483	Bronchoalveolar lavage	457	91.4
BAL	3,267	38,483	Blood alcohol level	43	8.6
OTC	6,173	37,356	Over the counter	469	93.8
OTC	6,173	37,356	Ornithine transcarbamoylase	31	6.2

As shown in Fig. 1 , 235 senses have between 2 and 99 examples in the data set, while 67 senses have between 200 and 249 examples. Consequently, implementing a separate classifier for each abbreviation requires more examples for each sense. It is noticeable how the senses are strongly unbalanced such as most abbreviations in the data set.

Fig. 1

Pie chart of senses and number of examples relation, showing the frequency and the percentage of each sense in the University of Minnesota (UMN) data set.

Approach

BERT 19 is an embedding layer representation or language model that is obtained by making use of Transformer architecture, 26 an attention mechanism that learns contextual relations between words (or subwords) in an annotated text. The transformer has two components, but only the encoder is used in BERT. On the contrary to directional models, which read the input sequence either left-to-right or right-to-left, BERT reads the whole sequence of words in both directions at once. This is the main feature for BERT that allows the model to deeply represent words based on all the surrounding contexts. BERT uses two strategies during the training process: Masked Language Model (MLM) and Next Sentence Prediction (NSP). In the MLM strategy, before training, 15% of the words in the input sequence are replaced with [MASK] tokens; during the training, the model goal is to predict the correct words based on the context surrounding of [MASK] tokens. On the other hand, in NSP, the input sample is composed of a pair of sentences, and the goal of the classification is to predict whether the pair of sentences is correlated or not. Thanks to these strategies, the BERT encoder is able to obtain a vector not only for each word, but also for each sense of a word. Furthermore, BERT can be used as part of a feature-based approach or fine-tuning approach. For feature-based WSD models, deep contextualized word embedding representations from BERT are used as input features for task-specific architectures. 27 Fine-tuning WSD models use the weights of the pretrained models and perform an additional training using annotated corpora. 28 In this work, a fine-tuning approach was applied to the UMN data set, implementing NSP strategies with a one-fits-all classifier. Three clinical pretrained BERT models were fine-tuned to disambiguate clinical abbreviations. These models were built from BERT-base types, which mean that they have the same architecture ( Table 3 ). The differences among them are the type and size of the resources that were used to pretrain the model, as described below:

Table 3

The pretrained models architecture is used in this study

Characteristic	No.
Layers	12
Hidden units	768
Self-attention heads	12
Total trainable parameters	110M

Clinical BioBert: 29 Two million clinical notes from the MIMIC-III 30 v1.4 database were used to train the BioBERT 31 model to generate a pretrained word embedding. BlueBERT: 32 This BERT model was trained with more than 4,000 million words from PubMed abstracts (biomedical resources) combined with more than 500 million words from MIMIC-III as clinical resources. MS-BERT: 33 This model was developed in the University of Toronto and the Data Science and Advanced Analytics department at St. Michael's Hospital. A BlueBERT model is fine-tuned on another 35 million words from Multiple Sclerosis (MS) examinations clinical notes.

Preprocessing

As described below, different preprocessing steps were performed on the text before it was fed into the model. Cleaning: First, punctuation and special characters were removed. Then, all the characters were converted to lowercase. Tokenization: Contexts and the expansions are tokenized utilizing each model tokenizer applying Word-Piece tokenizing techniques. If the entire word does not exist on the model vocabulary, it will be broken down into substrings to handle unseen words. Special token addition: Since our objective is to represent the problem as a NSP BERT approach, a special token [CLS] was added at the beginning of the first segment, which is the context in our data set; then two special tokens [SEP] were added, the first one was to separate between the two segments (context and expansion) and the second was added at the end of the second segment. Truncate and padding, attention mask: All input sequences should have the same length, which is 512 tokens, to fit the BERT input format; so that if the sequence length is more than 512, it was truncated; however, [PAD] tokens were added at the end of sequences which is less than the required length. Attention mask layers were added to indicate to the model which tokens should be attended to, and which should not. For sequence classification with BERT, token type IDs (also called segment IDs) are deployed. They were represented as a binary mask identifying the two types of sequence in the model (context and expansion in this work). Data set splitting: 80% of the data set was used as a training data set with 29,560 sentences and 20% from the data set was separated in half between development and test data sets with 3,695 samples for each one. Fig.2 shows an example of the input sequence for the abbreviation “AB.” Here, the context is “... received phototherapy for a peak bilirubin level of 11.5 mg%. Her blood type was AB positive…,” and the correct expansion is “blood group in ABO system.” [CLS] token was added at the beginning of the context, two [SEP] tokens were added to indicate two sequence classifications. Finally, the [PAD] tokens unify the length of all sequences as mentioned above.

Fig. 2

An example of input representation for one sequence including [CLS], [SEP], and [PAD] tokens, in addition to added segments, attention mask, and embedding layers.

An example of input representation for one sequence including [CLS], [SEP], and [PAD] tokens, in addition to added segments, attention mask, and embedding layers. Labeling: A multiclass model for developing one classifier for the entire data set is the contribution of this research. The data set contains 348 unique senses, labeled from 0 to 347, so the model has 347 different classes.

Proposed Architecture

This work aims to develop a one-fits-all classifier instead of one classifier for each abbreviation in the data set. To achieve it, a simple architecture was added on the top of the pretrained BERT layer to prove that BERT can provide a great contextualized presentation that could perform well without further sophisticated architectures. Fig. 3 shows the three main stages of the system. The first is the preprocessing step, and then the sequence embedding was computed using a BERT encoder. Since the hidden state of [CLS] token represents the whole sequence, it was fed into a feedforward layer, 34 activation ReLU, 35 and another feedforward layer as the following equation:

Fig. 3

The architecture of the proposed model.

The architecture of the proposed model. where are fully connected linear layers, . Notice that the softmax layer was not explicity implemented because it is within the cross-entropy loss function. The hyperparameters for all experiments were identically adjusted in the three pretrained models. The batch size was 8, and the epochs for the models were 5. The learning rate was 1 × 10 −5 . Adam optimizer 36 was used, in addition to the cross-entropy loss function.

Evaluation Metric

The accuracy measure was used to evaluate the performance of the three models and compare it with other similar models. The accuracy is defined as follows:

Results

Table 4 gives overall results on the UMN data set. The first observation is that the three models achieved a high accuracy with a bit of difference. A multiclass pretrained model based on MS_BERT achieved the best performance among the three, with a slight difference.

Table 4

Accuracy results for the University of Minnesota (UMN) data set

Pretrainedlanguage model (LM)	Accuracy (%)
Pretrainedlanguage model (LM)	Training	Validation	Test
Bio_Clinical	98.85	98.97	98.99
BlueBERT	98.46	98.73	98.75
MS_BERT	98.98	99.11	99.13

Note: Slightly different between the three pretrained models.

Note: Slightly different between the three pretrained models. Between all-models comparisons, since the UMN data set was released, many researchers tried to work on it from different aspects, like increasing the number of samples for each sense in the data set or implementing a separated classifier for each abbreviation in the data set. The result of our models will be compared to the most similar work, which focuses on building one classifier for all the abbreviations in the data set. Table 5 shows the results of the most related previous work concerning this work.

Table 5

Accuracy results for the University of Minnesota (UMN) data set compared with several previous works

Methods	Accuracy (%)
Multiclassifier
Convolutional Neural Network (CNN) 8	77.83
CLASS-GATOR 37	76.92
One-fits-all classifier
ELMo + Topic 9	70.41
Latent Meaning Cells (LMC) 38	71.00
Candidate Classification 39	98.39
MS_BERT (our approach)	99.13

Note: Our model achieves the state-of-the-art with MS_BERT pretrained model.

Note: Our model achieves the state-of-the-art with MS_BERT pretrained model. The first two approaches Convolutional Neural Network (CNN) 8 and CLASS-GATOR 37 focus on autogenerated training data to use in their models in addition to the inputs' representation. Pretrained static embeddings from PubMed were combined with POS tags, and clinical note features were used to represent the input in the UMN data set, then fed into CNN models. The reverse substitution was used to generate more training data by searching exact matching strings for the senses. 8 CLASS-GATOR 37 trained a logistic regression model on a biomedical data set (PubMed), then tested the trained model on an unannotated clinical data set from MIMIC III. This approach was applied to 52 abbreviations from the UMN data set with BioBERT, which achieved better performance than their logistic regression model. On the one-fits-all classifier side, the first two approaches ELMo + Topic 9 and Latent Meaning Cells (LMC) 38 combined the content with external knowledgeable parts, like headers, to get the contextualized representation for the data set; the first one used ELMo. This approach was applied to 30 abbreviations from the same data set and tested on 15 abbreviations from the MIMIC-III data set. The second used LMC. 38 However, in the third approach (Candidate Classification), 39 the classifier was fed with embeddings of candidate expansions contextualized by the context as input.

Error Analysis

In this section, we analyze the error of the MS_BERT predictions, which had the best performance. The test data set contains 38 unseen senses, and the model correctly predicted 29 of them. After analyzing the results, the errors are mainly due to the right and predicted senses belonging to the same medical problem, and the classifiers cannot distinguish them. For example, “AVR” has 6 senses, two of which belong to heart problems; “aortic valve resistance” and “aortic valve replacement.” However, “aortic valve replacement” has 318 examples in the whole data set, and “aortic valve resistance” has 4 examples. The exact situation happened with “LE” and “RA” abbreviations, both having senses related to heart problems, “left ventricle:LV” and “right atrium” with 5 and 345 examples, respectively. “GT” has 16 senses, “guttae” has one example and “gutta” has 16 examples. “guttae” is the plural of “gutta,” another case that the model was not able to distinguish.

Discussion

We propose in this research a solution to disambiguate clinical abbreviations to advance in methods to solve the issue of rare senses (senses with few examples in training data). Three pretrained BERT models have been tested to analyze the impact of previously generated language models in the task of WSD. Table 4 shows that the results are closely related, with MS BERT achieving the best results by a slight margin. Therefore, models that have been pretrained with extra clinical data perform better than models that have just been pretrained with biomedical scientific literature. The results are also compared with the most recent previous work in WSD for abbreviations. From the results shown in Table 5 , it is noticeable that our WSD models outperform the state-of-the-art of the previous work using the same data set.

Conclusion

Currently, more than 80% of data in electronic health records is unstructured data (images, text, etc.); it is essential to foster research in information extraction from clinical text to obtain structured data that could be used in decision-making processes. Abbreviations are very common in the clinical narrative. The aim of this research is to explore methods to disambiguate senses of abbreviations in the clinical text, which has been less explored than biomedical literature. In this article, publicly published pretrained language models such as BioBERT, BlueBert, and MS_Bert were fine-tuned to disambiguate clinical abbreviations. In contrast to the well-known approaches of building a specific classifier for each abbreviation using supervised disambiguating methods, we explore one-fits-all classifiers, which proved their ability to achieve the best performance. This one-fits-all classifier approach could be applied not only to restricted resources domains (such as those in the clinical domain) but also to process text in languages other than English. In future works, we plan to extend the work to auto disambiguate more clinical abbreviations, not restricted to the definite number used in this work and explore different neural network architectures for disambiguating clinical abbreviations that make clinical notes more readable for the patients.

16 in total

1. A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources.

Authors: Sungrim Moon; Serguei Pakhomov; Nathan Liu; James O Ryan; Genevieve B Melton
Journal: J Am Med Inform Assoc Date: 2013-06-27 Impact factor: 4.497

Disambiguating Clinical Abbreviations Using a One-Fits-All Classifier Based on Deep Learning Techniques.

Introduction

Objectives

Methods

Dataset Description

Approach

Preprocessing

Proposed Architecture

Evaluation Metric

Results

Error Analysis

Discussion

Conclusion

1. A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources.

2. Use of abbreviations by healthcare professionals: what is the way forward?

3. Towards Comprehensive Clinical Abbreviation Disambiguation Using Machine-Labeled Training Data.

4. A convolutional route to abbreviation disambiguation in clinical text.

5. Automated disambiguation of acronyms and abbreviations in clinical texts: window and training size considerations.

6. A new clustering method for detecting rare senses of abbreviations in clinical notes.

7. Ambiguous medical abbreviation study: challenges and opportunities.

8. Zero-Shot Clinical Acronym Expansion via Latent Meaning Cells.

9. Bootstrapping a de-identification system for narrative patient records: cost-performance tradeoffs.

10. BioBERT: a pre-trained biomedical language representation model for biomedical text mining.