Yiqing Zhao1, Saravut J Weroha2, Ellen L Goode3, Hongfang Liu1, Chen Wang4. 1. Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, 205 3rd Ave SW, Rochester, MN, 55905, USA. 2. Division of Medical Oncology, Department of Oncology, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA. 3. Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, 205 3rd Ave SW, Rochester, MN, 55905, USA. 4. Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 205 3rd Ave SW, Rochester, MN, 55905, USA. wang.chen@mayo.edu.
Abstract
BACKGROUND: Next-generation sequencing provides comprehensive information about individuals' genetic makeup and is commonplace in oncology clinical practice. However, the utility of genetic information in the clinical decision-making process has not been examined extensively from a real-world, data-driven perspective. Through mining real-world data (RWD) from clinical notes, we could extract patients' genetic information and further associate treatment decisions with genetic information. METHODS: We proposed a real-world evidence (RWE) study framework that incorporates context-based natural language processing (NLP) methods and data quality examination before final association analysis. The framework was demonstrated in a Foundation-tested women cancer cohort (N = 196). Upon retrieval of patients' genetic information using NLP system, we assessed the completeness of genetic data captured in unstructured clinical notes according to a genetic data-model. We examined the distribution of different topics regarding BRCA1/2 throughout patients' treatment process, and then analyzed the association between BRCA1/2 mutation status and the discussion/prescription of targeted therapy. RESULTS: We identified seven topics in the clinical context of genetic mentions including: Information, Evaluation, Insurance, Order, Negative, Positive, and Variants of unknown significance. Our rule-based system achieved a precision of 0.87, recall of 0.93 and F-measure of 0.91. Our machine learning system achieved a precision of 0.901, recall of 0.899 and F-measure of 0.9 for four-topic classification and a precision of 0.833, recall of 0.823 and F-measure of 0.82 for seven-topic classification. We found in result-containing sentences, the capture of BRCA1/2 mutation information was 75%, but detailed variant information (e.g. variant types) is largely missing. Using cleaned RWD, significant associations were found between BRCA1/2 positive mutation and targeted therapies. CONCLUSIONS: In conclusion, we demonstrated a framework to generate RWE using RWD from different clinical sources. Rule-based NLP system achieved the best performance for resolving contextual variability when extracting RWD from unstructured clinical notes. Data quality issues such as incompleteness and discrepancies exist thus manual data cleaning is needed before further analysis can be performed. Finally, we were able to use cleaned RWD to evaluate the real-world utility of genetic information to initiate a prescription of targeted therapy.
BACKGROUND: Next-generation sequencing provides comprehensive information about individuals' genetic makeup and is commonplace in oncology clinical practice. However, the utility of genetic information in the clinical decision-making process has not been examined extensively from a real-world, data-driven perspective. Through mining real-world data (RWD) from clinical notes, we could extract patients' genetic information and further associate treatment decisions with genetic information. METHODS: We proposed a real-world evidence (RWE) study framework that incorporates context-based natural language processing (NLP) methods and data quality examination before final association analysis. The framework was demonstrated in a Foundation-tested womencancer cohort (N = 196). Upon retrieval of patients' genetic information using NLP system, we assessed the completeness of genetic data captured in unstructured clinical notes according to a genetic data-model. We examined the distribution of different topics regarding BRCA1/2 throughout patients' treatment process, and then analyzed the association between BRCA1/2 mutation status and the discussion/prescription of targeted therapy. RESULTS: We identified seven topics in the clinical context of genetic mentions including: Information, Evaluation, Insurance, Order, Negative, Positive, and Variants of unknown significance. Our rule-based system achieved a precision of 0.87, recall of 0.93 and F-measure of 0.91. Our machine learning system achieved a precision of 0.901, recall of 0.899 and F-measure of 0.9 for four-topic classification and a precision of 0.833, recall of 0.823 and F-measure of 0.82 for seven-topic classification. We found in result-containing sentences, the capture of BRCA1/2 mutation information was 75%, but detailed variant information (e.g. variant types) is largely missing. Using cleaned RWD, significant associations were found between BRCA1/2 positive mutation and targeted therapies. CONCLUSIONS: In conclusion, we demonstrated a framework to generate RWE using RWD from different clinical sources. Rule-based NLP system achieved the best performance for resolving contextual variability when extracting RWD from unstructured clinical notes. Data quality issues such as incompleteness and discrepancies exist thus manual data cleaning is needed before further analysis can be performed. Finally, we were able to use cleaned RWD to evaluate the real-world utility of genetic information to initiate a prescription of targeted therapy.
Entities:
Keywords:
BRCA1/2; Electronic health records; Natural language processing; PARP inhibitor; Precision medicine; Real-world evidence
Authors: Brian H Shirts; Joseph S Salama; Samuel J Aronson; Wendy K Chung; Stacy W Gray; Lucia A Hindorff; Gail P Jarvik; Sharon E Plon; Elena M Stoffel; Peter Z Tarczy-Hornoch; Eliezer M Van Allen; Karen E Weck; Christopher G Chute; Robert R Freimuth; Robert W Grundmeier; Andrea L Hartzler; Rongling Li; Peggy L Peissig; Josh F Peterson; Luke V Rasmussen; Justin B Starren; Marc S Williams; Casey L Overby Journal: J Am Med Inform Assoc Date: 2015-07-03 Impact factor: 4.497
Authors: C Lerman; S Narod; K Schulman; C Hughes; A Gomez-Caminero; G Bonney; K Gold; B Trock; D Main; J Lynch; C Fulmore; C Snyder; S J Lemon; T Conway; P Tonin; G Lenoir; H Lynch Journal: JAMA Date: 1996-06-26 Impact factor: 56.272
Authors: Jung Hoon Son; Gangcai Xie; Chi Yuan; Lyudmila Ena; Ziran Li; Andrew Goldstein; Lulin Huang; Liwei Wang; Feichen Shen; Hongfang Liu; Karla Mehl; Emily E Groopman; Maddalena Marasa; Krzysztof Kiryluk; Ali G Gharavi; Wendy K Chung; George Hripcsak; Carol Friedman; Chunhua Weng; Kai Wang Journal: Am J Hum Genet Date: 2018-06-28 Impact factor: 11.025
Authors: Tina A Eyre; Fabrice Ducluzeau; Tam P Sneddon; Sue Povey; Elspeth A Bruford; Michael J Lush Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971
Authors: Hongfang Liu; Suzette J Bielinski; Sunghwan Sohn; Sean Murphy; Kavishwar B Wagholikar; Siddhartha R Jonnalagadda; K E Ravikumar; Stephen T Wu; Iftikhar J Kullo; Christopher G Chute Journal: AMIA Jt Summits Transl Sci Proc Date: 2013-03-18
Authors: Vinod C Kaggal; Ravikumar Komandur Elayavilli; Saeed Mehrabi; Joshua J Pankratz; Sunghwan Sohn; Yanshan Wang; Dingcheng Li; Majid Mojarad Rastegar; Sean P Murphy; Jason L Ross; Rajeev Chaudhry; James D Buntrock; Hongfang Liu Journal: Biomed Inform Insights Date: 2016-06-23
Authors: Marie-Pier Gauthier; Jennifer H Law; Lisa W Le; Janice J N Li; Sajda Zahir; Sharon Nirmalakumar; Mike Sung; Christopher Pettengell; Steven Aviv; Ryan Chu; Adrian Sacher; Geoffrey Liu; Penelope Bradbury; Frances A Shepherd; Natasha B Leighl Journal: JTO Clin Res Rep Date: 2022-05-17