Literature DB >> 33623888

Natural language processing for abstraction of cancer treatment toxicities: accuracy versus human experts.

Julian C Hong^1,2,3, Andrew T Fairchild³, Jarred P Tanksley³, Manisha Palta³, Jessica D Tenenbaum⁴.

Abstract

OBJECTIVES: Expert abstraction of acute toxicities is critical in oncology research but is labor-intensive and variable. We assessed the accuracy of a natural language processing (NLP) pipeline to extract symptoms from clinical notes compared to physicians.
MATERIALS AND METHODS: Two independent reviewers identified present and negated National Cancer Institute Common Terminology Criteria for Adverse Events (CTCAE) v5.0 symptoms from 100 randomly selected notes for on-treatment visits during radiation therapy with adjudication by a third reviewer. A NLP pipeline based on Apache clinical Text Analysis Knowledge Extraction System was developed and used to extract CTCAE terms. Accuracy was assessed by precision, recall, and F1.
RESULTS: The NLP pipeline demonstrated high accuracy for common physician-abstracted symptoms, such as radiation dermatitis (F1 0.88), fatigue (0.85), and nausea (0.88). NLP had poor sensitivity for negated symptoms.
CONCLUSION: NLP accurately detects a subset of documented present CTCAE symptoms, though is limited for negated symptoms. It may facilitate strategies to more consistently identify toxicities during cancer therapy.

Entities: Chemical

Keywords: cancer; chemoradiation; natural language processing; radiation therapy; toxicity

Year: 2020 PMID： 33623888 PMCID： PMC7886534 DOI： 10.1093/jamiaopen/ooaa064

Source DB: PubMed Journal: JAMIA Open ISSN： 2574-2531

LAY SUMMARY

Expert abstraction of acute toxicities is critical in oncology research but can be labor-intensive and highly variable. We developed and assessed a natural language processing (NLP) pipeline to extract symptoms from clinical notes in comparison to physician reviewers. NLP accurately identified documented present Common Terminology Criteria for Adverse Event symptoms but had limited detection for documented negated symptoms. Given limitations in human review, it may facilitate research strategies to more consistently identify toxicities during cancer therapy.

INTRODUCTION

The abstraction of treatment and disease-related symptomology is critical in oncology research. As prospective toxicity documentation on clinical trials underestimates adverse events, the most rigorous method integrates retrospective human review, forming the anchor of both prospective and retrospective studies in oncology. However, manual review, whether by a clinician or clinical research assistant, is labor-intensive and prone to human variation., This critical clinical and analytical need presents an important potential use for the implementation of natural language processing (NLP). NLP can leverage increasing computational power and electronic health records (EHRs) to automate the systematic extraction of data from free text. Clinical NLP has been an area of active interest given the expansive and important data locked exclusively in clinical free-text notes. A number of broad clinical NLP tools have been developed and are available to extract content from clinical notes, including Apache clinical Text Analysis Knowledge Extraction System (cTAKES), MetaMap, and Clinical Language Annotation, Modeling, and Processing (CLAMP) Toolkit. Continued evolutions in machine learning such as deep learning have subsequently facilitated specific use-cases where underlying patterns in text can be associated directly with specific concepts, such as clinical outcomes. In oncology, NLP efforts have largely focused on the extraction of data and insights from semistructured text, such as radiology and pathology reports. There have been limited efforts evaluating the accuracy of the extraction of toxicity data. In particular, NLP tools are limited by their gold standard corpora, and the annotations generated by a few reviewers. Adaptation and validation for specific use can enable its use in clinical research. Its implementation offers opportunities for more consistent extraction of clinical data and may also facilitate automated extraction of data to augment clinical prediction and decision support tools., Given the limitations and variability in human expert review, the objective of this study was to develop and evaluate an NLP pipeline against human expert reviewers for the extraction of the National Cancer Institute (NCI) Common Terminology Criteria for Adverse Events (CTCAE) symptoms.

METHODS

This study was approved by the Duke University Medical Center Institutional Review Board (Pro00082776). We developed an NLP pipeline based on publicly available tools for extracting CTCAE v5.0 terms from oncology notes. As previously described, 100 randomly selected notes for weekly scheduled radiotherapy on-treatment visits (OTV) at a single academic center between 2005 and 2016 were independently reviewed by two senior radiation oncology residents. Patients undergoing radiotherapy are seen by their physicians during weekly OTVs to manage symptoms related to treatment or disease. The documentation for these visits can be institution-specific, but are typically brief in a SOAP format, with a subjective section describing patient symptoms, an objective section including focused physical exam findings, and an assessment and plan. OTV documentation is typically captured in a medical center-wide EHR (as is the case at our institution) or in a department-centric oncology information system. At our institution, notes are primarily free-text, though standardized EHR templates prepopulate vital signs and physical exam headers, and physical exam findings can be selected from predefined options. Style and content varied across physicians and disease sites. As with other radiation oncology notes, specialty abbreviations can be included, such as “fx,” for “fraction” (the delivery of one radiation treatment), “Gy” the abbreviation for “Gray” (a unit of radiation dose). However, language would be anticipated to be recognizable across oncologic specialties, particularly in describing symptoms. OTV notes also have a very limited automated text population in comparison to consultation or follow-up notes. Notes reviewed in this study did not include explicit structured CTCAE toxicities. Reviewers were instructed to comprehensively identify explicitly present, negated, or not mentioned CTCAE symptoms and were blinded to each other’s labels. This was performed utilizing a checklist of all CTCAE terms, sorted by the system, available from the NCI in multiple formats. A thesaurus (previously published and embedded in available code on GitHub) was created to harmonize overlapping CTCAE terms identified by the reviewers (e.g. cough and productive cough)., Labels were then reviewed by an attending radiation oncologist to create a consensus. The plain text notes were processed through the open-source Apache cTAKES v4.0.0 default clinical pipeline. cTAKES consists of multiple components to process clinical free text, including a sentence boundary detector, tokenizer, normalizer, part-of-speech tagger, shallow parser, and a named entity recognition annotator with negation. The default clinical pipeline is an easily accessible deployment which includes annotations for the most commonly desired outputs. Among the annotations provided are anatomical sites, signs/symptoms, procedures, disease/disorders, and medications. These were initially mapped as SNOMED CT terminology and mapped to Medical Dictionary for Regulatory Activities (MedDRA) terms using the Observational Health Data Sciences and Informatics Athena vocabulary. Since v4.0, CTCAE has been integrated into MedDRA, with mapping available from the NCI. Our code for processing the cTakes extracted terms is available on GitHub. Given additional MedDRA terms identified by cTAKES, we generated and made publicly available a separate thesaurus to map alternative terms to corresponding CTCAE elements (Supplementary Data and available on GitHub). NLP output was compared against human consensus. For both human and NLP abstraction, symptoms with multiple appearances in a note were designated as present if there was at least one positive mention. Standard evaluation statistics were generated, including precision (positive predictive value), recall (sensitivity), and F1 (harmonic mean of precision and recall) for individual symptoms. The unweighted Cohen’s kappa coefficient between NLP and each of the reviewers was also assessed to provide a broad assessment.,

RESULTS

As previously described, 100 notes written by 15 physicians were evaluated, representing diverse disease sites (Table 1). No notes were from the same patient or treatment course. Among the most commonly present terms on human review, such as radiation dermatitis, fatigue, nausea, pruritis, and noninfectious cystitis, NLP demonstrated overall good precision, recall, and F1 (Table 1). Of note, the NLP pipeline did not detect urinary urgency (MedDRA code 10046593). It was, however, very sensitive (1.00) for noninfectious cystitis (F1 0.75). NLP demonstrated good performance in identifying some symptoms that had previously demonstrated low human inter-rater reliability, including radiation dermatitis, fatigue, noninfectious cystitis, and folliculitis. Precision was also more limited for documentation of pain (0.36; F1 0.45). Example NLP errors for pain and diarrhea, two more common symptoms, are presented in Table 2.

Table 1.

Note characteristics and extracted symptoms

Word count	Median 203	IQR 164.5–237.5
Character count	Median 1324.5	IQR 1103.25–1592.5
Number of note authors	15
Disease site	Number (N = 100)
Breast	32
Head and neck	15
Prostate	13
Central nervous system	10
Lung	8
Gynecologic	7
Bladder	4
Metastases (spine, spine, adrenal, leg/lung)	4
Sarcoma	3
Esophagus	1
Skin	1
Pelvic lymphoma	1
Multiple myeloma	1
Most common present symptoms	Number present (N = 100)	Precision (PPV)	Recall (sensitivity)	F1	Reviewer Kappa
Dermatitis-radiation	35	0.97	0.80	0.88	0.57
Fatigue	34	1.00	0.74	0.85	0.51
Pain	24	0.36	0.63	0.45	0.65
Nausea	13	0.92	0.85	0.88	0.86
Pruritus	11	0.91	0.91	0.91	0.67
Cystitis, noninfectious	9	0.60	1.00	0.75	0.00
Diarrhea	8	0.28	0.63	0.38	0.92
Mucositis	8	0.83	0.63	0.71	0.62
Urinary urgency	8	NA	0.00	NA	0.83
Folliculitis	7	1.00	0.14	0.25	0.00
Hot flashes	7	0.54	1.00	0.70	0.92
Total	277
Most common negated symptoms	Number negated (N = 100)	Precision (PPV)	Recall (sensitivity)	F1	Reviewer Kappa
Dermatitis-radiation	42	0.89	0.19	0.31	0.57
Pain	27	0.5	0.07	0.13	0.65
Superficial soft tissue fibrosis	19	NA	0.00	NA	0
Diarrhea	18	1	0.11	0.20	0.92
Seroma	18	NA	0.00	NA	0.93
Thrush	16	1	0.31	0.48	0.11
Hematuria	16	NA	0.00	NA	0.88
Hematochezia	16	1	0.06	0.12	0.93
Dysuria	15	NA	0.00	NA	0.81
Pruritis	13	1	0.85	0.92	0.67
Urinary incontinence	13	NA	0.00	NA	0.96
Total	358

IQR: interquartile range; PPV: positive predictive value.

Number present or negated based on consensus adjudication of identifications by both reviewers, rather than the total number of times symptoms were identified by either reviewer.

Table 2.

Examples of challenging note phrases for common symptoms

Note phrase
“significant pain on the right side of his face”
“instructed on soft foods and pain control for maintaining PO intake”
“she is not having any residual pain”
“she had one episode of diarrhea today”
“she has been having 5–6 loose bowel movements daily, taking 3 Imodium/day”
“diarrhea none”

Note characteristics and extracted symptoms IQR: interquartile range; PPV: positive predictive value. Number present or negated based on consensus adjudication of identifications by both reviewers, rather than the total number of times symptoms were identified by either reviewer. Examples of challenging note phrases for common symptoms NLP was more limited in detecting negated symptoms, the most common of which were radiation dermatitis, pain, and soft tissue fibrosis (Table 1). In general for negated symptoms, NLP demonstrated low recall, though accompanied with high precision. NLP did demonstrate strong detection for the negation of pruritis, which was noted as a negated symptom in 13 notes. For comparison with inter-rater variability of expert abstraction, the unweighted Cohen’s kappa coefficients compared to each reviewer were 0.52 (95% confidence interval 0.49–0.56) and 0.49 (0.52–0.55). This was lower than the unweighted kappa between the two reviewers (0.68, 0.65–0.71).

DISCUSSION

NLP offers a potential method for detecting specific documented present CTCAE v5.0 symptoms in comparison to human review. Given the effort and inter-rater variability intrinsic to the expert review, it may be a good option for systematically assessing toxicities from unstructured clinical data. Additionally, it may also support or validate toxicities identified during clinical trials or retrospective analysis. Notably, there was greater variability between NLP and each individual reviewer than across the two reviewers. In particular, it was limited in its ability to identify expert-identified negated symptoms, a more semantically complex task. However, most research use-cases prioritize the identification of present symptoms. NLP did demonstrate worse performance with certain symptoms, among these, notably pain (0.36 precision and 0.63 sensitivity). This may be attributable to the multitude of pain-related terms, which may reduce recognition accuracy. There were a number of false positives. These included more complex concepts such as anticipatory guidance for future symptoms—“instructed on soft foods and pain control for maintaining PO intake.”—as well as examples of missed negations— “she is not having any residual pain.” Several examples of missed identification explicitly included the word “pain,” with some demonstrating more separation between the term identifying pain and the site—for example, phrases such as “significant pain on the right side of his face.” NLP did not identify this example as general or site-specific pain concepts. Diarrhea was another term that challenged NLP (precision 0.28 and recall 0.63). We identified simple misses—“she had one episode of diarrhea today”—as well as more ambiguous phrases—“she has been having 5–6 loose bowel movements daily, taking 3 Imodium/day.” False positives were primarily missed negations, including those for incomplete sentences like “Diarrhea none.” In our study, NLP was compared against annotation by multiple senior radiation oncology residents, who we expect to have comparable accuracy to attending physicians given their responsibility for the majority of clinical documentation as well as their active academic engagement in evaluating studies that incorporate CTCAE. This specialty multiexpert review is also important in evaluating this specific use case, as NLP efforts, clinical trials, and retrospective studies alike frequently utilize individual clinical research assistants, medical students, or nonsubject matter experts. NLP has had an increasing number of applications in the oncology space. There have been limited data validating the effectiveness of NLP to extract accurately named entities in comparison to human review. A number of efforts have occurred in the semistructured space, working to extract data from pathology reports, including staging and histology, and in radiology, including Breast Imaging Reporting and Data System (BI-RADS) assessments. Efforts in plain text have also been utilized to identify patients with advanced and metastatic cancer. Importantly, the Cancer Deep Phenotype Extraction (DeepPhe) system is a cancer-centric NLP system built on cTAKES for the abstraction of comprehensive cancer information. Separate from semantically extracting information from notes, other recent studies have focused on the use of aggregate data for outcome prediction., Within symptom identification, work has been built off the larger field of adverse drug event monitoring. For cancer, this has been fairly limited; a prior work demonstrated the use of identifying topics within patient communications via the patient portal, demonstrating that side effect terms were associated with early discontinuation of hormonal therapy. This demonstrates an additional potential application of accurate symptom identification, as patients frequently will supplement PRO questionnaires with free text data. This study is limited by its small sample of notes and expert reviewers at a single institution. Additionally, OTV notes are specifically intended to report toxicities and may overrepresent this information. The notes in our sample are also brief with very limited autopopulated text in comparison to documentation for other encounters. Thus, it is possible that the reported performance may not generalize across all oncology documentation. However, the gold standard data set upon which cTAKES was initially built was based on four total human annotators and 1556 annotations on 160 clinical notes, and our study does serve as an external assessment for this specific use case. These limitations also underscore the labor intensity of expert review and emphasize the need for high-quality computational tools. Additionally, it is likely that our current pipeline, built from currently freely available “out-of-the-box” tools developed at a separate institution, would demonstrate additional accuracy with additional modifications. Development of a separate model was considered, but it was ultimately decided to dedicate the statistical power of our manual annotation towards externally assessing the default configuration of a broadly available and used software package. Finally, this study focused on the extraction of symptoms in isolated notes across distinct patients. Attribution of toxicities would require temporal assessment across many notes, which would require more intensive abstraction by our expert reviewers. This study offers important use-case specific modifications and an independent external assessment of the tool. Importantly, we demonstrate that NLP offers good performance for the identification of specific symptoms in comparison to human expert reviewers. While altering physician workflows to consistently prospectively document acute toxicities may be a potential option (adopted at some institutions), prior data suggest that it may underreport symptoms documented in the clinical chart. Furthermore, the tools we evaluated and generated in this study are freely available online. NLP did not have strong detection of expert-identified negated symptoms, which likely reflects its greater complexity; expert-defined negations also identified specific scenarios that prompt disagreement during manual review. The implementation of NLP, in addition to offering an alternative to human review for the ascertainment of toxicities, can also be implemented to validate manually collected toxicities; this may augment the detection of toxicities on clinical trials and in retrospective research. Accurate extraction of clinical elements from the free text may offer refined and rational features for predictive models, building on studies where aggregate text provided utility in predicting clinical outcomes., Our team recently completed one of the first prospective, randomized studies of machine learning, utilizing EHR data to generate accurate predictions of acute care, and direct supportive care. NLP offers an additional source of insights from routine clinical data that may augment its performance for clinical decision support.,

CONCLUSIONS

In conclusion, the use of an NLP pipeline facilitates CTCAE symptom identification via publicly available tools. In light of reviewer variability, this may serve as a tool to improve the consistency of toxicity capture in retrospective and prospective settings.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online.

FUNDING

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Publication made possible in part by support from the UCSF Open Access Publishing Fund.

Conflict of interest statement

J.D.T., M.P., and J.C.H. are coinventors on a pending patent, “Systems and methods for predicting acute care visits during outpatient cancer therapy,” broadly related to this manuscript.

Data availability statement

The data underlying this article cannot be shared publicly due to the privacy of individuals whose data were used in the study. Click here for additional data file.

20 in total

1. Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports.

Authors: George Hripcsak; John H M Austin; Philip O Alderson; Carol Friedman
Journal: Radiology Date: 2002-07 Impact factor: 11.105

2. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications.

Authors: Guergana K Savova; James J Masanz; Philip V Ogren; Jiaping Zheng; Sunghwan Sohn; Karin C Kipper-Schuler; Christopher G Chute
Journal: J Am Med Inform Assoc Date: 2010 Sep-Oct Impact factor: 4.497

3. Data from clinical notes: a perspective on the tension between structure and flexible documentation.

Authors: S Trent Rosenbloom; Joshua C Denny; Hua Xu; Nancy Lorenzi; William W Stead; Kevin B Johnson
Journal: J Am Med Inform Assoc Date: 2011-01-12 Impact factor: 4.497

4. System for High-Intensity Evaluation During Radiation Therapy (SHIELD-RT): A Prospective Randomized Study of Machine Learning-Directed Clinical Evaluations During Radiation and Chemoradiation.

Authors: Julian C Hong; Neville C W Eclov; Nicole H Dalal; Samantha M Thomas; Sarah J Stephens; Mary Malicki; Stacey Shields; Alyssa Cobb; Yvonne M Mowery; Donna Niedzwiecki; Jessica D Tenenbaum; Manisha Palta
Journal: J Clin Oncol Date: 2020-09-04 Impact factor: 44.544

5. Patient free text reporting of symptomatic adverse events in cancer clinical research using the National Cancer Institute's Patient-Reported Outcomes version of the Common Terminology Criteria for Adverse Events (PRO-CTCAE).

Authors: Arlene E Chung; Kimberly Shoenbill; Sandra A Mitchell; Amylou C Dueck; Deborah Schrag; Deborah W Bruner; Lori M Minasian; Diane St Germain; Ann M O'Mara; Paul Baumgartner; Lauren J Rogak; Amy P Abernethy; Ashley C Griffin; Ethan M Basch
Journal: J Am Med Inform Assoc Date: 2019-04-01 Impact factor: 4.497

6. Predicting Emergency Visits and Hospital Admissions During Radiation and Chemoradiation: An Internally Validated Pretreatment Machine Learning Algorithm.

Authors: Julian C Hong; Donna Niedzwiecki; Manisha Palta; Jessica D Tenenbaum
Journal: JCO Clin Cancer Inform Date: 2018-12

7. DeepPhe: A Natural Language Processing System for Extracting Cancer Phenotypes from Clinical Records.

Authors: Guergana K Savova; Eugene Tseytlin; Sean Finan; Melissa Castine; Timothy Miller; Olga Medvedeva; David Harris; Harry Hochheiser; Chen Lin; Girish Chavan; Rebecca S Jacobson
Journal: Cancer Res Date: 2017-11-01 Impact factor: 12.701

Review 8. What can natural language processing do for clinical decision support?

Authors: Dina Demner-Fushman; Wendy W Chapman; Clement J McDonald
Journal: J Biomed Inform Date: 2009-08-13 Impact factor: 6.317

9. Accuracy of Adverse Event Ascertainment in Clinical Trials for Pediatric Acute Myeloid Leukemia.

Authors: Tamara P Miller; Yimei Li; Marko Kavcic; Andrea B Troxel; Yuan-Shun V Huang; Lillian Sung; Todd A Alonzo; Robert Gerbing; Matt Hall; Marla H Daves; Terzah M Horton; Michael A Pulsipher; Jessica A Pollard; Rochelle Bagatell; Alix E Seif; Brian T Fisher; Selina Luger; Alan S Gamis; Peter C Adamson; Richard Aplenc
Journal: J Clin Oncol Date: 2016-02-16 Impact factor: 44.544

10. Automated Survival Prediction in Metastatic Cancer Patients Using High-Dimensional Electronic Medical Record Data.

Authors: Michael F Gensheimer; A Solomon Henry; Douglas J Wood; Trevor J Hastie; Sonya Aggarwal; Sara A Dudley; Pooja Pradhan; Imon Banerjee; Eunpi Cho; Kavitha Ramchandran; Erqi Pollom; Albert C Koong; Daniel L Rubin; Daniel T Chang
Journal: J Natl Cancer Inst Date: 2019-06-01 Impact factor: 13.506

4 in total

Review 1. Evolution of Hematology Clinical Trial Adverse Event Reporting to Improve Care Delivery.

Authors: Tamara P Miller; Richard Aplenc
Journal: Curr Hematol Malig Rep Date: 2021-03-30 Impact factor: 3.952

2. Deep Learning for Cancer Symptoms Monitoring on the Basis of Electronic Health Record Unstructured Clinical Notes.

Authors: Charlotta Lindvall; Chih-Ying Deng; Nicole D Agaronnik; Anne Kwok; Soujanya Samineni; Renato Umeton; Warren Mackie-Jenkins; Kenneth L Kehl; James A Tulsky; Andrea C Enzinger
Journal: JCO Clin Cancer Inform Date: 2022-06

3. Improving infectious adverse event reporting for children and adolescents enrolled in clinical trials for acute lymphoblastic leukemia: A report from the Children's Oncology Group.

Authors: Caitlin W Elgarten; Joel C Thompson; Anne Angiolillo; Zhiguo Chen; Susan Conway; Meenakshi Devidas; Sumit Gupta; John A Kairalla; Jennifer L McNeer; Maureen M O'Brien; Karen R Rabin; Rachel E Rau; Susan R Rheingold; Cindy Wang; Charlotte Wood; Elizabeth A Raetz; Mignon L Loh; Sarah Alexander; Tamara P Miller
Journal: Pediatr Blood Cancer Date: 2022-09-09 Impact factor: 3.838

4. Implementation of machine learning in the clinic: challenges and lessons in prospective deployment from the System for High Intensity EvaLuation During Radiation Therapy (SHIELD-RT) randomized controlled study.

Authors: Julian C Hong; Neville C W Eclov; Sarah J Stephens; Yvonne M Mowery; Manisha Palta
Journal: BMC Bioinformatics Date: 2022-09-30 Impact factor: 3.307

4 in total