| Literature DB >> 32435697 |
Joyce Kam1, Lucia Yin1, Somain Verma1, Julia Ive2, Natalia Viani1, Stephen Puntis3, Rudolf N Cardinal4,5, Angus Roberts1, Robert Stewart1,6, Sumithra Velupillai1.
Abstract
A serious obstacle to the development of Natural Language Processing (NLP) methods in the clinical domain is the accessibility of textual data. The mental health domain is particularly challenging, partly because clinical documentation relies heavily on free text that is difficult to de-identify completely. This problem could be tackled by using artificial medical data. In this work, we present an approach to generate artificial clinical documents. We apply this approach to discharge summaries from a large mental healthcare provider and discharge summaries from an intensive care unit. We perform an extensive intrinsic evaluation where we (1) apply several measures of text preservation; (2) measure how much the model memorises training data; and (3) estimate clinical validity of the generated text based on a human evaluation task. Furthermore, we perform an extrinsic evaluation by studying the impact of using artificial text in a downstream NLP text classification task. We found that using this artificial data as training data can lead to classification results that are comparable to the original results. Additionally, using only a small amount of information from the original data to condition the generation of the artificial data is successful, which holds promise for reducing the risk of these artificial data retaining rare information from the original data. This is an important finding for our long-term goal of being able to generate artificial clinical data that can be released to the wider research community and accelerate advances in developing computational methods that use healthcare data.Entities:
Keywords: Medical research; Scientific community
Year: 2020 PMID: 32435697 PMCID: PMC7224173 DOI: 10.1038/s41746-020-0267-x
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Fig. 1Overview of the text generation procedure.
Key phrases are extracted from paragraphs in the original data (genuine paragraph), and combined with clinical information (ICD-10 diagnosis code, gender and age). This is used in our text generation model, producing an artificial paragraph.
Fig. 2Overview of the extrinsic evaluation procedure.
An NLP model is built on 1) genuine data and 2) artificial data. Both models are tested on real (genuine) test data. Comparing these results gives an indication of the usefulness of using artificial data for NLP model development.
Mental health diagnoses from discharge summaries in the CRIS database. We report frequency for test-gen-mhr.
| ICD-10 | Description | Freq (%) |
|---|---|---|
| F20 | Schizophrenia | 29 |
| F32 | Major depressive disorder, single episode | 21 |
| F60 | Specific personality disorders | 16 |
| F31 | Bipolar affective disorder | 14 |
| F25 | Schizoaffective disorders | 11 |
| F10 | Mental and behavioural disorders due to use of alcohol | 9 |
Qualitative evaluation and average sentence lengths on the CRIS data (test-gen-mhr) and the MIMIC-III data (test-gen-mimic). Models providing data closest to the original data according to all the scores are highlighted in bold.
| PPL | ROUGE-L↑ | BLEU↑ | TER↓ | ∼ | |
|---|---|---|---|---|---|
| genuine | − | − | − | − | 22.44 |
| 7.24 | 0.76 | 40.88 | 0.39 | 17.84 | |
| 37.46 | 0.40 | 10.29 | 0.80 | 10.63 | |
| − | 0.58 | 7.75 | 0.56 | 10.21 | |
| genuine | − | − | − | − | 17.55 |
| 3.22 | 0.81 | 53.45 | 0.31 | 14.5 | |
| 9.75 | 0.47 | 16.66 | 0.74 | 9.72 | |
| − | 0.59 | 8.70 | 0.56 | 7.94 | |
Fig. 3Cumulative distributions (CDFs) of the TER bins for the key, all, top+meta, and one+meta sentences for test-gen-mhr.
X-axis plots TER bins. Y-axis—respective cumulative frequencies of the test-gen-mhr sentences.
Memorisation assessment for 1K samples per n-gram group in the top+meta train-gen-mhr model. high denotes n-grams from the upper frequency quartile; low—n-grams from the lower frequency quartile; %,in denotes percentage of target n-grams in the input key phrases; and %,out—in the respective generated output. Highest PPL values are highlighted in bold.
| 2-gram | 3-gram | 5-gram | |||||
|---|---|---|---|---|---|---|---|
| High | Low | High | Low | High | Low | ||
| %, In | 16 | 40 | 4 | 12 | 0.3 | 0.8 | |
| %, Out | 48 | 48 | 43 | 34 | 41 | 29 | |
| PPL, K | 18 | 17 | 21 | ||||
Annotation categories for the human evaluation of the meaning of the generated text.
| Category | Group | |
|---|---|---|
| 1 | Fully preserved | SAME |
| 2 | Preserved, details omitted | GOOD |
| 3 | Modified, does not contradict the diagnosis | GOOD |
| 4 | Modified, contradicts the diagnosis | BAD/IRR |
| 5 | Modified, irrelevant | BAD/IRR |
| 6 | No clinical sense | NO SENSE |
| 7 | Incomprehensible | NO SENSE |
Fig. 4Matrix of inter-rater annotation agreement for 1K top+meta sentences.
For each document, we defined A1 as the first annotator and A2 as the second annotator. Each cell in the matrix represents the number of sentences marked by an annotator with a certain category (as defined in Table 4).
Examples of artificial sentences and respective real sentences (all paraphrased) for test-class-mhr.
| Fully preserved | real | There was no clear evidence that he was responding to unseen stimuli. |
| art. | No clear evidence of responding to unseen stimuli. | |
| Preserved, details omitted | real | He did not have a clear understanding of why he was there or what was the problem with him. |
| art. | He has no clear understanding why he is there. | |
| Modified, does not contradict the diagnosis | real | That afternoon police were called after she assaulted her mother. |
| art. | This afternoon police were called by her mother. | |
| Modified, contradicts the diagnosis | real | He was not experiencing low mood or anhedonia and therefore does not meet the criteria for depressive disorder. |
| art. | Today he continues to experience low mood and anxiety. | |
| Modified, irrelevant | real | Her partner wants him to stay with him. |
| art. | Her partner wants him to get out of bed. | |
| No clinical sense | real | She acknowledged that paracetamol overdose could damage her liver. |
| art. | Paracetamol overdose could damage her shoulder. | |
| Incomprehensible | real | This relapse of heavy drinking may have been caused by the disruption of her accommodation relocation. |
| art. | It was felt that heavy drinking may not be a accommodation relocation. |
Examples of disagreements on artificial sentences and respective real sentences (all paraphrased) for test-class-mhr.
| GOOD vs. NO SENSE | real | When he was approached by the police, he started removing his trousers and becoming quite aggressive. |
| art. | He had started removing the hair of the window, becoming quite aggressive. | |
| GOOD vs. BAD/IRR | real | She appeared as though she felt under threat but the ward was very chaotic at that time with loud bangs. |
| art. | Chaotic and loud bangs in her interactions. |
Text classification results (F1-scores) for test-class-mhr (fivefold CV; results averaged per class). We use 2S-KS test for (a) comparisons between models trained with the same type of data; * marks statistically significant improvements for LDA over BoW, and CNN over LDA (α = 0.05, n1 = n2 = 30); (b) comparisons within a model trained with the different types of data (column KS test). Models using less than all key phrases that provided results closest to those with real data are highlighted in bold. We also report results of our ablation experiments when the training data contain only the context of key phrases, real, or generated.
| ICD-10 | ||||||||
|---|---|---|---|---|---|---|---|---|
| KS test, ( | ||||||||
| BoW | ||||||||
| genuine | 0.47 | 0.31 | 0.32 | 0.20 | 0.14 | 0.24 | ||
| 0.47 | 0.33 | 0.27 | 0.23 | 0.17 | 0.23 | 0.07, 0.88 | ||
| 0.48 | 0.36 | 0.29 | 0.20 | 0.14 | 0.26 | 0.09, 0.61 | ||
| 0.47 | 0.27 | 0.26 | 0.11 | 0.12 | 0.23 | 0.17, 0.02 | ||
| LDA | ||||||||
| genuine* | 0.55 | 0.47 | 0.35 | 0.32 | 0.25 | 0.40 | ||
| 0.55 | 0.44 | 0.35 | 0.31 | 0.26 | 0.37 | 0.11, 0.35 | ||
| 0.50 | 0.45 | 0.36 | 0.28 | 0.23 | 0.39 | 0.14, 0.10 | ||
| 0.54 | 0.45 | 0.38 | 0.30 | 0.24 | 0.40 | 0.07, 0.88 | ||
| CNN | ||||||||
| genuine* | 0.66 | 0.59 | 0.51 | 0.37 | 0.23 | 0.53 | ||
| 0.65 | 0.57 | 0.47 | 0.27 | 0.24 | 0.50 | 0.14, 0.10 | ||
| 0.59 | 0.52 | 0.42 | 0.25 | 0.15 | 0.43 | 0.22, 1e−3 | ||
| 0.57 | 0.34 | 0.33 | 0.23 | 0.20 | 0.35 | 0.37, 1.9e−09 | ||
| No key phrases | ||||||||
| CNN | ||||||||
| genuine | 0.48 | 0.34 | 0.22 | 0.22 | 0.15 | 0.12 | 0.25 | |
| 0.30 | 0.30 | 0.09 | 0.25 | 0.09 | 0.03 | 0.18 | 0.24, 2.7e−04 | |
| LDA | ||||||||
| genuine* | 0.41 | 0.40 | 0.32 | 0.22 | 0.20 | 0.26 | 0.30 | |
| 0.29 | 0.37 | 0.28 | 0.23 | 0.14 | 0.25 | 0.26 | 0.23, 4.4e−04 | |
Text classification results (averaged F1-scores) for test-class-mimic. We use 2S-KS test for (a) comparisons between different models trained with the same type of data. * Marks statistically significant improvements for LDA over BoW, and for CNN over LDA (α = 0.05, n1 = n2 =65); and (b) comparisons within a model trained with the different types of data (column KS test). Models using less than all key phrases that provided results closest to those with real data are highlighted in bold.
| KS test, ( | ||
|---|---|---|
| LDA | ||
| genuine | 0.23 | |
| 0.21 | 0.22, 0.08 | |
| 0.21 | 0.23, 0.05 | |
| 0.13 | 0.54, 5.15e−09 | |
| BoW | ||
| genuine* | 0.34 | |
| 0.32 | 0.14, 0.53 | |
| 0.27 | 0.29, 0.01 | |
| 0.30 | 0.19, 0.20 | |
| CNN | ||
| genuine* | 0.46 | |
| 0.45 | 0.12, 0.68 | |
| 0.36 | 0.35, 4e−4 | |
| 0.24 | 0.59, 1.5e−10 | |