Duy Duc An Bui1, Qing Zeng-Treitler1. 1. Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA VA Salt Lake City Health Care System, Salt Lake City, Utah, USA.
Abstract
OBJECTIVES: Natural language processing (NLP) applications typically use regular expressions that have been developed manually by human experts. Our goal is to automate both the creation and utilization of regular expressions in text classification. METHODS: We designed a novel regular expression discovery (RED) algorithm and implemented two text classifiers based on RED. The RED+ALIGN classifier combines RED with an alignment algorithm, and RED+SVM combines RED with a support vector machine (SVM) classifier. Two clinical datasets were used for testing and evaluation: the SMOKE dataset, containing 1091 text snippets describing smoking status; and the PAIN dataset, containing 702 snippets describing pain status. We performed 10-fold cross-validation to calculate accuracy, precision, recall, and F-measure metrics. In the evaluation, an SVM classifier was trained as the control. RESULTS: The two RED classifiers achieved 80.9-83.0% in overall accuracy on the two datasets, which is 1.3-3% higher than SVM's accuracy (p<0.001). Similarly, small but consistent improvements have been observed in precision, recall, and F-measure when RED classifiers are compared with SVM alone. More significantly, RED+ALIGN correctly classified many instances that were misclassified by the SVM classifier (8.1-10.3% of the total instances and 43.8-53.0% of SVM's misclassifications). CONCLUSIONS: Machine-generated regular expressions can be effectively used in clinical text classification. The regular expression-based classifier can be combined with other classifiers, like SVM, to improve classification performance. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
OBJECTIVES: Natural language processing (NLP) applications typically use regular expressions that have been developed manually by human experts. Our goal is to automate both the creation and utilization of regular expressions in text classification. METHODS: We designed a novel regular expression discovery (RED) algorithm and implemented two text classifiers based on RED. The RED+ALIGN classifier combines RED with an alignment algorithm, and RED+SVM combines RED with a support vector machine (SVM) classifier. Two clinical datasets were used for testing and evaluation: the SMOKE dataset, containing 1091 text snippets describing smoking status; and the PAIN dataset, containing 702 snippets describing pain status. We performed 10-fold cross-validation to calculate accuracy, precision, recall, and F-measure metrics. In the evaluation, an SVM classifier was trained as the control. RESULTS: The two RED classifiers achieved 80.9-83.0% in overall accuracy on the two datasets, which is 1.3-3% higher than SVM's accuracy (p<0.001). Similarly, small but consistent improvements have been observed in precision, recall, and F-measure when RED classifiers are compared with SVM alone. More significantly, RED+ALIGN correctly classified many instances that were misclassified by the SVM classifier (8.1-10.3% of the total instances and 43.8-53.0% of SVM's misclassifications). CONCLUSIONS: Machine-generated regular expressions can be effectively used in clinical text classification. The regular expression-based classifier can be combined with other classifiers, like SVM, to improve classification performance. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
Entities:
Keywords:
Machine Learning; Natural Language Processing; Regular Expressions; Support Vector Machines; Text Classification
Authors: Hua Xu; Min Jiang; Matt Oetjens; Erica A Bowton; Andrea H Ramirez; Janina M Jeff; Melissa A Basford; Jill M Pulley; James D Cowan; Xiaoming Wang; Marylyn D Ritchie; Daniel R Masys; Dan M Roden; Dana C Crawford; Joshua C Denny Journal: J Am Med Inform Assoc Date: 2011 Jul-Aug Impact factor: 4.497
Authors: Rosa L Figueroa; Qing Zeng-Treitler; Long H Ngo; Sergey Goryachev; Eduardo P Wiechmann Journal: J Am Med Inform Assoc Date: 2012-06-15 Impact factor: 4.497
Authors: Kavishwar B Wagholikar; Kathy L MacLaughlin; Michael R Henry; Robert A Greenes; Ronald A Hankey; Hongfang Liu; Rajeev Chaudhry Journal: J Am Med Inform Assoc Date: 2012-04-29 Impact factor: 4.497
Authors: Samah Jamal Fodeh; Dezon Finch; Lina Bouayad; Stephen L Luther; Han Ling; Robert D Kerns; Cynthia Brandt Journal: Med Biol Eng Comput Date: 2017-12-26 Impact factor: 2.602
Authors: Kavishwar B Wagholikar; Christina M Fischer; Alyssa Goodson; Christopher D Herrick; Martin Rees; Eloy Toscano; Calum A MacRae; Benjamin M Scirica; Akshay S Desai; Shawn N Murphy Journal: J Med Syst Date: 2018-09-25 Impact factor: 4.460