Sajjad Abedian1, Evan T Sholle1,2, Prakash M Adekkanattu1, Marika M Cusick1, Stephanie E Weiner1, Jonathan E Shoag2, Jim C Hu2, Thomas R Campion1,3,4,5. 1. Information Technologies and Services Department, Weill Cornell Medicine, New York, NY. 2. Department of Population Health Sciences, Weill Cornell Medicine, New York, NY. 3. Department of Urology, Weill Cornell Medicine, New York, NY. 4. Clinical and Translational Science Center, Weill Cornell Medicine, New York, NY. 5. Department of Pediatrics, Weill Cornell Medicine, New York, NY.
Abstract
PURPOSE: Typically stored as unstructured notes, surgical pathology reports contain data elements valuable to cancer research that require labor-intensive manual extraction. Although studies have described natural language processing (NLP) of surgical pathology reports to automate information extraction, efforts have focused on specific cancer subtypes rather than across multiple oncologic domains. To address this gap, we developed and evaluated an NLP method to extract tumor staging and diagnosis information across multiple cancer subtypes. METHODS: The NLP pipeline was implemented on an open-source framework called Leo. We used a total of 555,681 surgical pathology reports of 329,076 patients to develop the pipeline and evaluated our approach on subsets of reports from patients with breast, prostate, colorectal, and randomly selected cancer subtypes. RESULTS: Averaged across all four cancer subtypes, the NLP pipeline achieved an accuracy of 1.00 for International Classification of Diseases, Tenth Revision codes, 0.89 for T staging, 0.90 for N staging, and 0.97 for M staging. It achieved an F1 score of 1.00 for International Classification of Diseases, Tenth Revision codes, 0.88 for T staging, 0.90 for N staging, and 0.24 for M staging. CONCLUSION: The NLP pipeline was developed to extract tumor staging and diagnosis information across multiple cancer subtypes to support the research enterprise in our institution. Although it was not possible to demonstrate generalizability of our NLP pipeline to other institutions, other institutions may find value in adopting a similar NLP approach-and reusing code available at GitHub-to support the oncology research enterprise with elements extracted from surgical pathology reports.
PURPOSE: Typically stored as unstructured notes, surgical pathology reports contain data elements valuable to cancer research that require labor-intensive manual extraction. Although studies have described natural language processing (NLP) of surgical pathology reports to automate information extraction, efforts have focused on specific cancer subtypes rather than across multiple oncologic domains. To address this gap, we developed and evaluated an NLP method to extract tumor staging and diagnosis information across multiple cancer subtypes. METHODS: The NLP pipeline was implemented on an open-source framework called Leo. We used a total of 555,681 surgical pathology reports of 329,076 patients to develop the pipeline and evaluated our approach on subsets of reports from patients with breast, prostate, colorectal, and randomly selected cancer subtypes. RESULTS: Averaged across all four cancer subtypes, the NLP pipeline achieved an accuracy of 1.00 for International Classification of Diseases, Tenth Revision codes, 0.89 for T staging, 0.90 for N staging, and 0.97 for M staging. It achieved an F1 score of 1.00 for International Classification of Diseases, Tenth Revision codes, 0.88 for T staging, 0.90 for N staging, and 0.24 for M staging. CONCLUSION: The NLP pipeline was developed to extract tumor staging and diagnosis information across multiple cancer subtypes to support the research enterprise in our institution. Although it was not possible to demonstrate generalizability of our NLP pipeline to other institutions, other institutions may find value in adopting a similar NLP approach-and reusing code available at GitHub-to support the oncology research enterprise with elements extracted from surgical pathology reports.
Authors: Anthony N Nguyen; Michael J Lawley; David P Hansen; Rayleen V Bowman; Belinda E Clarke; Edwina E Duhig; Shoni Colquist Journal: J Am Med Inform Assoc Date: 2010 Jul-Aug Impact factor: 4.497
Authors: Prakash Adekkanattu; Evan T Sholle; Joseph DeFerio; Jyotishman Pathak; Stephen B Johnson; Thomas R Campion Journal: AMIA Annu Symp Proc Date: 2018-12-05
Authors: Sami-Ramzi Leyh-Bannurah; Zhe Tian; Pierre I Karakiewicz; Ulrich Wolffgang; Guido Sauter; Margit Fisch; Dirk Pehrke; Hartwig Huland; Markus Graefen; Lars Budäus Journal: JCO Clin Cancer Inform Date: 2018-12
Authors: Guergana K Savova; Eugene Tseytlin; Sean Finan; Melissa Castine; Timothy Miller; Olga Medvedeva; David Harris; Harry Hochheiser; Chen Lin; Girish Chavan; Rebecca S Jacobson Journal: Cancer Res Date: 2017-11-01 Impact factor: 12.701
Authors: Adam Yala; Regina Barzilay; Laura Salama; Molly Griffin; Grace Sollender; Aditya Bardia; Constance Lehman; Julliette M Buckley; Suzanne B Coopey; Fernanda Polubriaginof; Judy E Garber; Barbara L Smith; Michele A Gadd; Michelle C Specht; Thomas M Gudewicz; Anthony J Guidi; Alphonse Taghian; Kevin S Hughes Journal: Breast Cancer Res Treat Date: 2016-11-08 Impact factor: 4.872
Authors: Brian J Kim; Madhur Merchant; Chengyi Zheng; Anil A Thomas; Richard Contreras; Steven J Jacobsen; Gary W Chien Journal: J Endourol Date: 2014-12 Impact factor: 2.942
Authors: Abdulrahman K AAlAbdulsalam; Jennifer H Garvin; Andrew Redd; Marjorie E Carter; Carol Sweeny; Stephane M Meystre Journal: AMIA Jt Summits Transl Sci Proc Date: 2018-05-18
Authors: Hamid Emamekhoo; Cibele B Carroll; Chelsea Stietz; Jeffrey B Pier; Michael D Lavitschke; Daniel Mulkerin; Mary E Sesto; Amye J Tevaarwerk Journal: JCO Clin Cancer Inform Date: 2022-06