Richard E Leiter1, Enrico Santus2, Zhijing Jin2, Katherine C Lee3, Miryam Yusufov4, Isabel Chien2, Ashwin Ramaswamy5, Edward T Moseley6, Yujie Qian2, Deborah Schrag7, Charlotta Lindvall8. 1. Harvard Medical School, Boston, Massachusetts, USA; Department of Psychosocial Oncology and Palliative Care, Dana-Farber Cancer Institute, Boston, Massachusetts, USA; Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA. Electronic address: richard_leiter@dfci.harvard.edu. 2. Massachusetts Institute of Technology, Boston, Massachusetts, USA. 3. Department of Surgery, University of California San Diego Health, San Diego, California, USA. 4. Harvard Medical School, Boston, Massachusetts, USA; Department of Psychosocial Oncology and Palliative Care, Dana-Farber Cancer Institute, Boston, Massachusetts, USA. 5. Department of Surgery, NewYork-Presbyterian Hospital/Weill Cornell Medical Center, New York, New York, USA. 6. Department of Psychosocial Oncology and Palliative Care, Dana-Farber Cancer Institute, Boston, Massachusetts, USA. 7. Harvard Medical School, Boston, Massachusetts, USA; Division of Population Sciences, Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA. 8. Harvard Medical School, Boston, Massachusetts, USA; Department of Psychosocial Oncology and Palliative Care, Dana-Farber Cancer Institute, Boston, Massachusetts, USA; Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA.
Abstract
CONTEXT: Clinicians lack reliable methods to predict which patients with congestive heart failure (CHF) will benefit from cardiac resynchronization therapy (CRT). Symptom burden may help to predict response, but this information is buried in free-text clinical notes. Natural language processing (NLP) may identify symptoms recorded in the electronic health record and thereby enable this information to inform clinical decisions about the appropriateness of CRT. OBJECTIVES: To develop, train, and test a deep NLP model that identifies documented symptoms in patients with CHF receiving CRT. METHODS: We identified a random sample of clinical notes from a cohort of patients with CHF who later received CRT. Investigators labeled documented symptoms as present, absent, and context dependent (pathologic depending on the clinical situation). The algorithm was trained on 80% and fine-tuned parameters on 10% of the notes. We tested the model on the remaining 10%. We compared the model's performance to investigators' annotations using accuracy, precision (positive predictive value), recall (sensitivity), and F1 score (a combined measure of precision and recall). RESULTS: Investigators annotated 154 notes (352,157 words) and identified 1340 present, 1300 absent, and 221 context-dependent symptoms. In the test set of 15 notes (35,467 words), the model's accuracy was 99.4% and recall was 66.8%. Precision was 77.6%, and overall F1 score was 71.8. F1 scores for present (70.8) and absent (74.7) symptoms were higher than that for context-dependent symptoms (48.3). CONCLUSION: A deep NLP algorithm can be trained to capture symptoms in patients with CHF who received CRT with promising precision and recall.
CONTEXT: Clinicians lack reliable methods to predict which patients with congestive heart failure (CHF) will benefit from cardiac resynchronization therapy (CRT). Symptom burden may help to predict response, but this information is buried in free-text clinical notes. Natural language processing (NLP) may identify symptoms recorded in the electronic health record and thereby enable this information to inform clinical decisions about the appropriateness of CRT. OBJECTIVES: To develop, train, and test a deep NLP model that identifies documented symptoms in patients with CHF receiving CRT. METHODS: We identified a random sample of clinical notes from a cohort of patients with CHF who later received CRT. Investigators labeled documented symptoms as present, absent, and context dependent (pathologic depending on the clinical situation). The algorithm was trained on 80% and fine-tuned parameters on 10% of the notes. We tested the model on the remaining 10%. We compared the model's performance to investigators' annotations using accuracy, precision (positive predictive value), recall (sensitivity), and F1 score (a combined measure of precision and recall). RESULTS: Investigators annotated 154 notes (352,157 words) and identified 1340 present, 1300 absent, and 221 context-dependent symptoms. In the test set of 15 notes (35,467 words), the model's accuracy was 99.4% and recall was 66.8%. Precision was 77.6%, and overall F1 score was 71.8. F1 scores for present (70.8) and absent (74.7) symptoms were higher than that for context-dependent symptoms (48.3). CONCLUSION: A deep NLP algorithm can be trained to capture symptoms in patients with CHF who received CRT with promising precision and recall.