| Literature DB >> 33919018 |
Daniel Bravo-Candel1, Jésica López-Hernández1, José Antonio García-Díaz1, Fernando Molina-Molina2, Francisco García-Sánchez1.
Abstract
Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing with patient information. Moreover, GloVe and Word2Vec pretrained word embeddings were used to study their performance. Despite the medicine corpus being much smaller than the Wikicorpus, Seq2seq models trained on the medicine corpus performed better than those models trained on the Wikicorpus. Nevertheless, a larger amount of clinical text is required to improve the results.Entities:
Keywords: clinical texts; error correction; natural language processing; real-word error; seq2seq neural machine translation model; word embeddings
Mesh:
Year: 2021 PMID: 33919018 PMCID: PMC8122440 DOI: 10.3390/s21092893
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Corpora and methods used in each study.
| Study | Corpora | Methods | Language |
|---|---|---|---|
| [ | Corpus of clinical records of the Emergency Department at the Concord Hospital, Sidney. | Rule-based system and frequency-based method for detection and correction, respectively. | English |
| [ | Brigham Young University bigram and trigram corpus. | Correction frequency-based method: bigram and trigram counting. | English |
| [ | Free-text allergy entries from Allergy Repository (PEAR). | Dictionary look-up detection method and noisy channel correction method. | English |
| [ | Clinical free-text from the Medical Information Mart for Intensive Care III (MIMIC-III) database. | No detection method. Word embedding cosine similarity for correction. | English |
| [ | Sentences extracted from Basque news websites. | Correction with seq-2-seq Neural Machine Translation model. | Basque |
| [ | Sentence pairs from 1B Word Benchmark. | BERT, RoBERTa and ELMo encoders and feature/fine-tuning based training for error location. | English |
| [ | Bangor Dyslexic Arabic Corpus (BDAC). | Dictionary look-up detection method. PPM language model, edit operations and compression codelength for correction. | Arabic |
| [ | Error detection based on the frequency of n-grams. Affinity, lexical context and context feature metrics to select candidate correction. | Spanish |
Figure 1Block diagram of the seq2seq-based approach to clinical texts correction.
Size of the Spanish medicine datasets.
| Dataset | Cases | Sentences | Tokens |
|---|---|---|---|
| CodiEsp | 3751 | 50,052 | 1,233,201 |
| MEDDOCAN | 1000 | 33,000 | 495,000 |
| SPACCC | 20,900 | 20,900 | 436,000 |
Different error types used. Errors were generated by replacing word substrings (in Spanish).
| Error Type | Substring Replacement | Erroneous Sentence |
|---|---|---|
| Grammatical genre |
| Dicho traspaso se realiza |
|
| Las aguas del oeste son | |
| Number |
| La |
|
| Consciente de sus numerosas | |
| Grammatical genre-number |
| Se caracterizan por tener diseños agresivos y |
|
| El proyecto fue | |
| Homophones |
| Es una de las compañías más importantes |
|
| Receta de pollo con lima y | |
| Hunspell |
| Más de la mitad de la gente vive ahora en la |
|
| El | |
| Subject-verb concordance |
| Tanto las mujeres como los hombres las |
|
|
Number of sentences for each error type for the Wikicorpus datasets. Error types: genre (E1), number (E2), genre-number (E3), homophone (E4), Hunspell-generated (E5) and subject-verb concordance (E6).
| Wikicorpus | E1 | E2 | E3 | E4 | E5 | E6 |
|---|---|---|---|---|---|---|
| Strategy 1 | 1,952,506 | 3,196,193 | 2,027,014 | 1,712,850 | 2,753,397 | 1,786,375 |
| Strategy 2 | 472,376 | 788,092 | 492,374 | 415,709 | 671,898 | 430,993 |
Number of sentences for each error type for the medicine datasets. Error types: genre (E1), number (E2), genre-number (E3), homophone (E4), Hunspell-generated (E5) and subject-verb concordance (E6).
| Medicine | E1 | E2 | E3 | E4 | E5 | E6 |
|---|---|---|---|---|---|---|
| Strategy 1 | 64,439 | 100,871 | 71,027 | 27,609 | 90,917 | 67,849 |
| Strategy 2 | 18,537 | 24,261 | 19,704 | 6000 | 20,004 | 18,852 |
Evaluation datasets statistics.
| Training | Validation * | Test | Total | |
|---|---|---|---|---|
|
| ||||
| Strategy 1 | 14,181,497 | 5000 | 10,000 | 14,196,497 |
| Strategy 2 | 3,512,147 | 5000 | 10,000 | 3,527,147 |
|
| ||||
| Strategy 1 | 276,835 | 5000 | 9533 | 291,368 |
| Strategy 2 | 75,062 | 5000 | 9533 | 89,595 |
|
| ||||
| Strategy 1 | 553,670 | 5000 | 10,000 | 568,670 |
| Strategy 2 | 150,124 | 5000 | 10,000 | 165,124 |
* The validation dataset is used during training to avoid the model overfitting.
Sample successful and failed corrections (in Spanish).
| Error Type | Result | Sentence (Source/Target/seq2seq Correction) |
|---|---|---|
| Grammatical genre | Success | Source: “Se procede a la electrocoagulación de varias áreas con sangrado |
| Failure | Source: “Estabilización y | |
| Number | Success | Source: “No obstante el |
| Failure | Source: “Sin | |
| Grammatical genre-number | Success | Source: “Los |
| Failure | Source: “La movilidad facial estaba | |
| Homophones | Success | Source: “Durante un año de seguimiento |
| Failure | Source: “Tampoco refiere el | |
| Hunspell | Success | Source: “En la observación |
| Failure | Source: “Tras la administración de contraste la | |
| Subject-verb concordance | Success | Source: “Se |
| Failure | Source: “Se |
Recall ( ), precision ( ) and measure for the datasets generated according to the strategies listed in Section 3.3.3 (‘Strategy 1’ and ‘Strategy 2’). The results are shown for the Seq2seq models with GloVe and Word2Vec pretrained embeddings and with no pretrained embedding at all (-). All error types were considered.
| Training Strategy |
|
|
| ||
|---|---|---|---|---|---|
| Wikicorpus | Strategy 1 | - | 37.61 | 70.19 | 59.77 |
| GloVe | 12.79 | 21.71 | 19.05 | ||
| W2V | 18.88 | 39.72 | 32.54 | ||
| Strategy 2 | - | 23.58 | 46.04 | 38.67 | |
| GloVe | 21.16 | 41.65 | 34.89 | ||
| W2V | 23.84 | 48.42 | 40.15 | ||
| Medicine corpus | Strategy 1 | - | 46.06 | 65.68 | 60.53 |
| GloVe | 15.79 | 17.95 | 17.48 | ||
| W2V | 40.07 | 54.31 | 50.71 | ||
| Strategy 2 | - | 50.92 | 69.79 | 64.98 | |
| GloVe | 21.58 | 24.71 | 24.01 | ||
| W2V | 46.69 | 61.82 | 58.08 | ||
| Medicine-wikicorpus | Strategy 1 | - | 20.61 | 45.83 | 36.82 |
| GloVe | 3.76 | 4.61 | 4.41 | ||
| W2V | 14.88 | 24.79 | 21.88 | ||
| Strategy 2 | - | 27.92 | 56.33 | 46.81 | |
| GloVe | 10.63 | 13.36 | 12.70 | ||
| W2V | 23.36 | 40.92 | 35.57 | ||
Recall ( ), precision ( ) and for the evaluation of genre error correction.
| Training Strategy | |||||
|---|---|---|---|---|---|
| Wikicorpus | Strategy 1 | - | 30.52 | 86.35 | 63.22 |
| GloVe | 13.92 | 30.20 | 24.47 | ||
| W2V | 21.53 | 65.93 | 46.68 | ||
| Strategy 2 | - | 32.16 | 83.14 | 63.13 | |
| GloVe | 25.09 | 53.56 | 43.66 | ||
| W2V | 30.57 | 74.60 | 57.92 | ||
| Medicine corpus | Strategy 1 | - | 36.71 | 81.70 | 65.62 |
| GloVe | 11.51 | 15.12 | 14.22 | ||
| W2V | 30.85 | 61.19 | 51.13 | ||
| Strategy 2 | - | 48.10 | 84.83 | 73.59 | |
| GloVe | 21.48 | 27.74 | 26.21 | ||
| W2V | 45.75 | 78.99 | 68.97 | ||
Recall ( ), precision ( ) and for the evaluation of number error correction.
| Training Strategy | |||||
|---|---|---|---|---|---|
| Wikicorpus | Strategy 1 | - | 43.01 | 87.81 | 72.67 |
| GloVe | 27.12 | 45.53 | 40.09 | ||
| W2V | 37.97 | 77.00 | 63.87 | ||
| Strategy 2 | - | 49.58 | 84.18 | 73.87 | |
| GloVe | 39.45 | 62.66 | 56.06 | ||
| W2V | 48.93 | 81.47 | 71.91 | ||
| Medicine corpus | Strategy 1 | - | 60.32 | 83.03 | 77.22 |
| GloVe | 22.35 | 27.05 | 25.96 | ||
| W2V | 55.67 | 74.54 | 69.81 | ||
| Strategy 2 | - | 72.11 | 86.01 | 82.82 | |
| GloVe | 28.24 | 31.42 | 30.73 | ||
| W2V | 66.57 | 79.15 | 76.27 | ||
Recall ( ), precision ( ) and for the evaluation of genre-number error correction.
| Training Strategy | |||||
|---|---|---|---|---|---|
| Wikicorpus | Strategy 1 | - | 35.07 | 82.37 | 64.87 |
| GloVe | 19.45 | 35.53 | 30.49 | ||
| W2V | 30.19 | 67.61 | 54.18 | ||
| Strategy 2 | - | 39.61 | 78.84 | 65.81 | |
| GloVe | 29.97 | 52.69 | 45.75 | ||
| W2V | 38.14 | 71.60 | 60.91 | ||
| Medicine corpus | Strategy 1 | - | 53.40 | 78.57 | 71.80 |
| GloVe | 16.40 | 18.63 | 18.13 | ||
| W2V | 50.74 | 70.71 | 65.55 | ||
| Strategy 2 | - | 70.57 | 84.90 | 81.59 | |
| GloVe | 27.64 | 30.48 | 29.87 | ||
| W2V | 64.94 | 77.74 | 74.80 | ||
Recall ( ), precision ( ) and for the evaluation of homophone error correction.
| Training Strategy | |||||
|---|---|---|---|---|---|
| Wikicorpus | Strategy 1 | - | 78.41 | 93.34 | 89.92 |
| GloVe | 8.77 | 19.63 | 15.73 | ||
| W2V | 13.20 | 51.16 | 32.48 | ||
| Strategy 2 | - | 17.20 | 66.52 | 42.28 | |
| GloVe | 12.98 | 33.62 | 25.51 | ||
| W2V | 30.57 | 74.59 | 57.92 | ||
| Medicine corpus | Strategy 1 | - | 72.24 | 88.76 | 84.88 |
| GloVe | 26.11 | 26.65 | 26.54 | ||
| W2V | 70.75 | 76.50 | 75.28 | ||
| Strategy 2 | - | 75.68 | 83.30 | 81.66 | |
| GloVe | 28.85 | 29.68 | 29.51 | ||
| W2V | 69.66 | 75.63 | 74.34 | ||
Recall ( ), precision ( ) and for the evaluation of Hunspell-generated error correction.
| Training Strategy | |||||
|---|---|---|---|---|---|
| Wikicorpus | Strategy 1 | - | 17.42 | 78.06 | 46.03 |
| GloVe | 15.49 | 40.96 | 30.82 | ||
| W2V | 20.27 | 58.74 | 42.58 | ||
| Strategy 2 | - | 10.25 | 56.60 | 29.72 | |
| GloVe | 7.17 | 28.00 | 17.71 | ||
| W2V | 7.52 | 31.88 | 19.34 | ||
| Medicine corpus | Strategy 1 | - | 36.10 | 75.40 | 61.92 |
| GloVe | 13.31 | 16.36 | 15.64 | ||
| W2V | 30.08 | 56.13 | 47.85 | ||
| Strategy 2 | - | 20.60 | 54.57 | 41.04 | |
| GloVe | 9.69 | 12.95 | 12.13 | ||
| W2V | 20.65 | 44.46 | 36.13 | ||
Recall ( ), precision ( ) and for the evaluation of subject-verb concordance error correction.
| Training Strategy | |||||
|---|---|---|---|---|---|
| Wikicorpus | Strategy 1 | - | 24.55 | 50.00 | 41.41 |
| GloVe | 7.89 | 16.55 | 13.57 | ||
| W2V | 10.96 | 33.95 | 23.92 | ||
| Strategy 2 | - | 15.23 | 44.48 | 32.14 | |
| GloVe | 11.89 | 32.58 | 24.17 | ||
| W2V | 14.24 | 38.46 | 28.70 | ||
| Medicine corpus | Strategy 1 | - | 44.55 | 59.30 | 55.61 |
| GloVe | 16.65 | 18.69 | 18.25 | ||
| W2V | 41.37 | 52.07 | 49.51 | ||
| Strategy 2 | - | 47.39 | 62.18 | 58.53 | |
| GloVe | 19.72 | 22.74 | 22.06 | ||
| W2V | 43.01 | 55.59 | 52.52 | ||
Comparison of measures before and after removing the subject-verb concordance error type and selecting Hunspell-generated errors of Levenshtein distance equal to two. No pretrained embedding was used. (E1) genre, (E2) number, (E3) genre-number, (E4) homophones and (E5) Hunspell-generated errors.
| Training Strategy | E1 | E2 | E3 | E4 | E5 | ||
|---|---|---|---|---|---|---|---|
| Wikicorpus | Strategy 1 | Before | 63.22 | 72.67 | 64.87 | 89.92 | 46.03 |
| After | 54.41 | 65.30 | 50.81 | 86.91 | 78.92 | ||
| Strategy 2 | Before | 63.13 | 73.83 | 65.81 | 42.28 | 29.72 | |
| After | 51.29 | 67.36 | 49.68 | 73.74 | 70.99 | ||
| Medicine corpus | Strategy 1 | Before | 65.62 | 77.22 | 71.80 | 84.88 | 61.92 |
| After | 50.02 | 71.96 | 62.14 | 77.09 | 77.53 | ||
| Strategy 2 | Before | 73.59 | 82.82 | 81.59 | 81.66 | 41.04 | |
| After | 56.52 | 74.78 | 67.16 | 72.44 | 74.42 | ||