Aaron Sonabend W1, Winston Cai2, Yuri Ahuja1, Ashwin Ananthakrishnan3, Zongqi Xia4, Sheng Yu5, Chuan Hong6. 1. Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA. 2. Bronx Science, New York City, NY, USA. 3. Division of Gastroenterology, Massachusetts General Hospital and Harvard Medical School, USA. 4. Department of Neurology and Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA. 5. Center for Statistical Science, Tsinghua University, Beijing, China; Department of Industrial Engineering, Tsinghua University, Beijing, China; Institute for Data Science, Tsinghua University, Beijing, China. 6. Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. Electronic address: chuan_hong@hms.harvard.edu.
Abstract
OBJECTIVE: Accurate coding is critical for medical billing and electronic medical record (EMR)-based research. Recent research has been focused on developing supervised methods to automatically assign International Classification of Diseases (ICD) codes from clinical notes. However, supervised approaches rely on ICD code data stored in the hospital EMR system and is subject to bias rising from the practice and coding behavior. Consequently, portability of trained supervised algorithms to external EMR systems may suffer. METHOD: We developed an unsupervised knowledge integration (UNITE) algorithm to automatically assign ICD codes for a specific disease by analyzing clinical narrative notes via semantic relevance assessment. The algorithm was validated using coded ICD data for 6 diseases from Partners HealthCare (PHS) Biobank and Medical Information Mart for Intensive Care (MIMIC-III). We compared the performance of UNITE against penalized logistic regression (LR), topic modeling, and neural network models within each EMR system. We additionally evaluated the portability of UNITE by training at PHS Biobank and validating at MIMIC-III, and vice versa. RESULTS: UNITE achieved an averaged AUC of 0.91 at PHS and 0.92 at MIMIC over 6 diseases, comparable to LR and MLP. It had substantially better performance than topic models. In regards to portability, the performance of UNITE was consistent across different EMR systems, superior to LR, topic models and neural network models. CONCLUSION: UNITE accurately assigns ICD code in EMR without requiring human labor, and has major advantages over commonly used machine learning approaches. In addition, the UNITE attained stable performance and high portability across EMRs in different institutions.
OBJECTIVE: Accurate coding is critical for medical billing and electronic medical record (EMR)-based research. Recent research has been focused on developing supervised methods to automatically assign International Classification of Diseases (ICD) codes from clinical notes. However, supervised approaches rely on ICD code data stored in the hospital EMR system and is subject to bias rising from the practice and coding behavior. Consequently, portability of trained supervised algorithms to external EMR systems may suffer. METHOD: We developed an unsupervised knowledge integration (UNITE) algorithm to automatically assign ICD codes for a specific disease by analyzing clinical narrative notes via semantic relevance assessment. The algorithm was validated using coded ICD data for 6 diseases from Partners HealthCare (PHS) Biobank and Medical Information Mart for Intensive Care (MIMIC-III). We compared the performance of UNITE against penalized logistic regression (LR), topic modeling, and neural network models within each EMR system. We additionally evaluated the portability of UNITE by training at PHS Biobank and validating at MIMIC-III, and vice versa. RESULTS: UNITE achieved an averaged AUC of 0.91 at PHS and 0.92 at MIMIC over 6 diseases, comparable to LR and MLP. It had substantially better performance than topic models. In regards to portability, the performance of UNITE was consistent across different EMR systems, superior to LR, topic models and neural network models. CONCLUSION: UNITE accurately assigns ICD code in EMR without requiring human labor, and has major advantages over commonly used machine learning approaches. In addition, the UNITE attained stable performance and high portability across EMRs in different institutions.
Authors: Kimberly J O'Malley; Karon F Cook; Matt D Price; Kimberly Raiford Wildes; John F Hurdle; Carol M Ashton Journal: Health Serv Res Date: 2005-10 Impact factor: 3.402
Authors: Jennifer A Sinnott; Fiona Cai; Sheng Yu; Boris P Hejblum; Chuan Hong; Isaac S Kohane; Katherine P Liao Journal: J Am Med Inform Assoc Date: 2018-10-01 Impact factor: 4.497
Authors: Alistair E W Johnson; Tom J Pollard; Lu Shen; Li-Wei H Lehman; Mengling Feng; Mohammad Ghassemi; Benjamin Moody; Peter Szolovits; Leo Anthony Celi; Roger G Mark Journal: Sci Data Date: 2016-05-24 Impact factor: 6.444
Authors: Andrew L Beam; Benjamin Kompa; Allen Schmaltz; Inbar Fried; Griffin Weber; Nathan Palmer; Xu Shi; Tianxi Cai; Isaac S Kohane Journal: Pac Symp Biocomput Date: 2020