| Literature DB >> 35494817 |
Oswaldo Solarte Pabón1,2, Orlando Montenegro2, Maria Torrente3, Alejandro Rodríguez González1, Mariano Provencio3, Ernestina Menasalvas1.
Abstract
Detecting negation and uncertainty is crucial for medical text mining applications; otherwise, extracted information can be incorrectly identified as real or factual events. Although several approaches have been proposed to detect negation and uncertainty in clinical texts, most efforts have focused on the English language. Most proposals developed for Spanish have focused mainly on negation detection and do not deal with uncertainty. In this paper, we propose a deep learning-based approach for both negation and uncertainty detection in clinical texts written in Spanish. The proposed approach explores two deep learning methods to achieve this goal: (i) Bidirectional Long-Short Term Memory with a Conditional Random Field layer (BiLSTM-CRF) and (ii) Bidirectional Encoder Representation for Transformers (BERT). The approach was evaluated using NUBES and IULA, two public corpora for the Spanish language. The results obtained showed an F-score of 92% and 80% in the scope recognition task for negation and uncertainty, respectively. We also present the results of a validation process conducted using a real-life annotated dataset from clinical notes belonging to cancer patients. The proposed approach shows the feasibility of deep learning-based methods to detect negation and uncertainty in Spanish clinical texts. Experiments also highlighted that this approach improves performance in the scope recognition task compared to other proposals in the biomedical domain. ©2022 Solarte Pabón et al.Entities:
Keywords: Clinical texts; Deep learning; Natural Language Processing; Negation and Uncertainty detection; Text mining
Year: 2022 PMID: 35494817 PMCID: PMC9044225 DOI: 10.7717/peerj-cs.913
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Summary of Deep learning-based approaches.
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| Qian2016 | CCN | English | Bioscope | yes | yes |
| Fancellu2017 | BiLSTM | English & Chinese | Bioscope & CNeSp | yes | no |
| Taylor2018 | BiLSTM | English | BioScope | yes | no |
| Uzuner2011 | Encoder- Decoder | English | i2b2/VA NLP challenge | no | yes |
| Attention2017 | RNN + Attention | English | Bioscope | yes | yes |
| Attention_Chen_2019 | RNN + Attention | English | i2b2/VA NLP challenge | yes | no |
| NegBERT2020 | BERT | English | Bioscope | yes | no |
| Dalloux2019 | BiLSTM | French | NCI - France | yes | yes |
| Al-khawaldeh2019 | Attention + BiLSTM | Arabic | Bio Arabic | no | yes |
| Santiso2018 | Embeddings + CRF | Spanish | IULA | yes | no |
| Santiso2020 | BiLSTM | Spanish | IxaMed-GS | yes | no |
| Zavala2020 | BiLSTM, BERT | Spanish | IULA | yes | no |
| Lima2020 | BiLSTM | Spanish | NUBES | yes | yes |
Figure 1Sentences with negation and uncertainty cues and their scopes.
Inter-annotator agreement for the Cancer dataset.
|
|
| |
|---|---|---|
| Negation cue | 0.96 | 0.95 |
| Negation scope | 0.94 | 0.92 |
| Uncertainty cue | 0.94 | 0.93 |
| Uncertainty scope | 0.92 | 0.90 |
A summary of the datasets used in the proposed approach.
|
|
|
|
|
|---|---|---|---|
| Number of sentences | 29,682 | 3,194 | 2,700 |
| Sentences with negation | 25.5% | 34% | 27% |
| Sentences with uncertainty | 7.5% | – | 12% |
| Maximum number of tokens | 210 | 159 | 181 |
| Mean (Number of tokens) | 18 | 14 | 15 |
| Median (Number of tokens) | 14 | 10 | 12 |
| First quartile | 9 | 6 | 7 |
| Third quartile | 23 | 19 | 18 |
A summary of cues and scopes in the datasets.
|
|
|
|
|
|---|---|---|---|
| Number of distinct negation cues | 345 | 46 | 52 |
| Number of distinct uncertainty cues | 303 | – | 70 |
| Total negation cues | 9318 | 1145 | 804 |
| Total uncertainty cues | 2529 | – | 345 |
| Syntactic negation cues | 85% | 92% | 83% |
| Lexical negation cues | 6% | 8% | 10% |
| Morphological negation cues | 9% | 7% | 2% |
| Syntactic uncertainty cues | 2% | – | 1% |
| Lexical uncertainty cues | 98% | – | 99% |
| Continuous scopes | 95% | 96% | 97% |
| Discontinuous scopes | 5% | 4% | 3% |
Figure 2Pre-processing steps to transform an annotated corpus into matrices.
Figure 3Negation and uncertainty detection using the BiLSTM-CRF model.
Figure 4Negation and uncertainty detection using multilingual BERT.
Results for each BIO label using the NUBES corpus (BiLSTM-CRF + Biomedical Embeddings).
|
|
|
|
|
|
|---|---|---|---|---|
| -PAD- | 1.0 | 1.0 | 1.0 | 474126 |
| B-NEG | 0.95 | 0.93 | 0.94 | 1423 |
| B-NSCO | 0.93 | 0.91 | 0.92 | 1322 |
| B-UNC | 0.86 | 0.84 | 0.85 | 400 |
| B-USCO | 0.84 | 0.79 | 0.81 | 400 |
| I-NEG | 0.89 | 0.86 | 0.87 | 120 |
| I-NSCO | 0.92 | 0.88 | 0.90 | 3901 |
| I-UNC | 0.85 | 0.75 | 0.80 | 168 |
| I-USCO | 0.82 | 0.77 | 0.79 | 1513 |
| O | 0.98 | 0.99 | 0.98 | 41977 |
Results for cue identification (Partial match).
| Negation | Uncertainty | Negation | |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| |
| BiLSTM-CRF | 0.86 | 0.82 | 0.83 | 0.79 | 0.76 | 0.77 | 0.82 | 0.78 | 0.80 |
| BiLSTM-CRF + Biomedical Embbedings | 0.94 | 0.92 | 0.93 | 0.85 | 0.81 | 0.83 | 0.91 | 0.90 | 0.90 |
| BiLSTM-CRF + Clinical Embbedings | 0.93 | 0.91 | 0.92 | 0.84 | 0.80 | 0.82 | 0.90 | 0.88 | 0.89 |
| Multilingual BERT | 0.95 | 0.93 |
| 0.86 | 0.83 |
| 0.92 | 0.93 |
|
Results for scope recognition (Partial match).
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Negation | Uncertainty | Negation | |||||||
| P | R | F1 | P | R | F1 | P | R | F1 | |
| BiLSTM-CRF | 0.84 | 0.76 | 0.79 | 0.72 | 0.69 | 0.70 | 0.77 | 0.74 | 0.75 |
| BiLSTM-CRF + Biomedical Embbedings | 0.92 | 0.89 | 0.90 | 0.82 | 0.77 | 0.79 | 0.88 | 0.84 | 0.86 |
| BiLSTM-CRF + Clinical Embbedings | 0.92 | 0.87 | 0.89 | 0.81 | 0.75 | 0.78 | 0.87 | 0.83 | 0.85 |
| Multilingual BERT | 0.93 | 0.90 |
| 0.82 | 0.79 |
| 0.91 | 0.86 |
|
Results for scope recognition (Exact match).
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Negation | Uncertainty | Negation | |||||||
|
|
|
|
|
|
|
|
|
| |
| 1. BiLSTM-CRF | 0.81 | 0.73 | 0.76 | 0.67 | 0.63 | 0.64 | 0.74 | 0.72 | 0.73 |
| 2. BiLSTM-CRF + Biomedical Embbedings | 0.88 | 0.85 | 0.86 | 0.73 | 0.70 | 0.71 | 0.84 | 0.83 | 0.84 |
| 3. BiLSTM-CRF + Clinical Embbedings | 0.86 | 0.84 | 0.84 | 0.71 | 0.68 | 0.69 | 0.84 | 0.80 | 0.82 |
| 4. Multilingual BERT | 0.90 | 0.86 |
| 0.75 | 0.70 |
| 0.89 | 0.84 |
|
Validation results with the Cancer dataset (Partial match).
|
|
| |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
| |||||||||
| P | R | F1 | P | R | F1 | P | R | F1 | P | R | F1 | |
| BiLSTM-CRF + Biomedical Embbedings | 0.89 | 0.87 | 0.88 | 0.75 | 0.80 | 0.78 | 0.86 | 0.83 | 0.84 | 0.79 | 0.74 | 0.76 |
| BiLSTM-CRF + Clinical Embbedings | 0.91 | 0.88 | 0.89 | 0.80 | 0.81 | 0.80 | 0.87 | 0.85 | 0.86 | 0.79 | 0.76 | 0.77 |
| Multilingual BERT | 0.91 | 0.89 |
| 0.84 | 0.80 |
| 0.89 | 0.86 |
| 0.79 | 0.78 |
|
Validation results with the Cancer dataset (Exact match).
|
|
| |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
| |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
| |
| BiLSTM-CRF + Biomedical Embbedings | 0.87 | 0.86 | 0.87 | 0.75 | 0.80 | 0.78 | 0.81 | 0.80 | 0.80 | 0.71 | 0.69 | 0.70 |
| BiLSTM-CRF + Clinical Embbedings | 0.91 | 0.88 | 0.89 | 0.80 | 0.81 | 0.80 | 0.82 | 0.83 | 0.75 | 0.72 | 0.71 | 0.71 |
| Multilingual BERT | 0.91 | 0.89 |
| 0.82 | 0.80 |
| 0.85 | 0.84 |
| 0.75 | 0.73 |
|
Comparison with other proposals in the cue identification task.
|
| ||||||
|---|---|---|---|---|---|---|
|
|
| |||||
| P | R | F1 | P | R | F1 | |
|
| 0.96 | 0.95 |
| 0.87 | 0.83 |
|
| Our approach (BiLSTM-based) | 0.94 | 0.92 | 0.93 | 0.85 | 0.81 | 0.83 |
| Our approach (BERT-based) | 0.95 | 0.95 |
| 0.86 | 0.83 | 0.84 |
Comparison with other proposals in the scope recognition task (F-score).
|
|
| ||
|---|---|---|---|
|
|
|
|
|
|
| – | – | 0.83 |
|
| – | – | 0.85 |
|
| 0.90 | 0.78 | – |
| Our approach (BiLSTM-based) | 0.90 | 0.79 | 0.84 |
| Our approach (BERT-based) |
|
|
|