| Literature DB >> 35190834 |
Kevin Xie1,2, Ryan S Gallagher2,3, Erin C Conrad3, Chadric O Garrick2, Steven N Baldassano1,2, John M Bernabei1,2, Peter D Galer1,2,4, Nina J Ghosn1,2, Adam S Greenblatt3, Tara Jennings3, Alana Kornspun3, Catherine V Kulick-Soper3, Jal M Panchal1,2,5, Akash R Pattnaik1,2, Brittany H Scheid1,2, Danmeng Wei3, Micah Weitzman6, Ramya Muthukrishnan7, Joongwon Kim7, Brian Litt1,2,3, Colin A Ellis2,3, Dan Roth7.
Abstract
OBJECTIVE: Seizure frequency and seizure freedom are among the most important outcome measures for patients with epilepsy. In this study, we aimed to automatically extract this clinical information from unstructured text in clinical notes. If successful, this could improve clinical decision-making in epilepsy patients and allow for rapid, large-scale retrospective research.Entities:
Keywords: electronic medical record; epilepsy; natural language processing; question-answering
Mesh:
Year: 2022 PMID: 35190834 PMCID: PMC9006692 DOI: 10.1093/jamia/ocac018
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Questions and corresponding task paradigm for the question-answering task
| Question | Task paradigm |
|---|---|
| Q1: “Has the patient had recent seizures?” | Classification |
| Q2: “How often does the patient have seizures?” | Text extraction |
| Q3: “When was the patient’s most recent seizure?” | Text extraction |
Figure 1.Schematic methods detailing the annotation and finetuning pipeline. Pretrained language models are finetuned using masked language modeling by exposing them to 78 000 unannotated clinical notes; task specific datasets (SQuADv2 and BoolQ3L); and the training set of annotated notes. This process was repeated for 5 different seeds.
Figure 2.Mean intragroup Cohen’s κ and F1 scores with 95% confidence intervals. Annotators within each group were compared pairwise and demonstrated high agreement across all groups. κ was calculated using annotator classification values (Q1). F1 was calculated using annotator text extractions (Q2 and Q3).
Figure 3.A comparison between the accuracy and F1 score of human and machine predictions. Machines achieved near human performance on the classification question, and human-like performance on the text extraction questions. Gray points denote individual annotator performance, whereas green, orange, and purple points denote individual Bio_ClinicalBERTFT, RoBERTaFT, and BERTFT seeds, respectively. Box plots show median and quartile ranges of values.
Figure 4.Contributions of the individual steps in the fine-tuning pipeline toward model performance. The masked language modeling step and Annotations influenced final model performance. Bio_ClinicalBERT and RoBERTa were finetuned for classification and text extraction, respectively, with the same seeds. A specified step was removed from the finetuning pipeline in each experiment. Box plots show median and quartile ranges of values.
Figure 5.Influence of training set size on mean model performance with 95% confidence intervals. Most of the models’ improvements occurred within the first 10% of the training data. Bio_ClinicalBERT and RoBERTa were finetuned for classification and text extraction, respectively, with the same seeds. Various annotated training set sizes were used for the final finetuning step.
Reasons for erroneous predictions
| Reason for consistent errors | Number of errors | |||
|---|---|---|---|---|
| Total | Classifying seizure freedom (Q1) | Extracting seizure frequency (Q2) | Extracting date of last seizure (Q3) | |
| Too much historical or irrelevant information
| 44 | 18 | 8 | 18 |
| Many relevant events with different information
| 8 | 8 | 0 | 0 |
| Identified irrelevant phenomena or failed to identify relevant phenomena from symptoms
| 10 | 1 | 7 | 2 |
| Unusual time anchor
| 3 | 0 | 3 | 0 |
| Temporal reasoning
| 11 | 9 | 0 | 2 |
| Semantic reasoning
| 6 | 0 | 1 | 5 |