Gerard Burger1, Ameen Abu-Hanna2, Nicolette de Keizer2, Ronald Cornet3. 1. Symbiant Pathology Expert Centre, Hoorn, The Netherlands Department of Medical Informatics, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands. 2. Department of Medical Informatics, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands. 3. Department of Medical Informatics, Academic Medical Center, University of Amsterdam, Amsterdam, The Netherlands Department of Biomedical Engineering, Linköping University, Linköping, Sweden.
Abstract
BACKGROUND: Encoded pathology data are key for medical registries and analyses, but pathology information is often expressed as free text. OBJECTIVE: We reviewed and assessed the use of NLP (natural language processing) for encoding pathology documents. MATERIALS AND METHODS: Papers addressing NLP in pathology were retrieved from PubMed, Association for Computing Machinery (ACM) Digital Library and Association for Computational Linguistics (ACL) Anthology. We reviewed and summarised the study objectives; NLP methods used and their validation; software implementations; the performance on the dataset used and any reported use in practice. RESULTS: The main objectives of the 38 included papers were encoding and extraction of clinically relevant information from pathology reports. Common approaches were word/phrase matching, probabilistic machine learning and rule-based systems. Five papers (13%) compared different methods on the same dataset. Four papers did not specify the method(s) used. 18 of the 26 studies that reported F-measure, recall or precision reported values of over 0.9. Proprietary software was the most frequently mentioned category (14 studies); General Architecture for Text Engineering (GATE) was the most applied architecture overall. Practical system use was reported in four papers. Most papers used expert annotation validation. CONCLUSIONS: Different methods are used in NLP research in pathology, and good performances, that is, high precision and recall, high retrieval/removal rates, are reported for all of these. Lack of validation and of shared datasets precludes performance comparison. More comparative analysis and validation are needed to provide better insight into the performance and merits of these methods. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://www.bmj.com/company/products-services/rights-and-licensing/
BACKGROUND: Encoded pathology data are key for medical registries and analyses, but pathology information is often expressed as free text. OBJECTIVE: We reviewed and assessed the use of NLP (natural language processing) for encoding pathology documents. MATERIALS AND METHODS: Papers addressing NLP in pathology were retrieved from PubMed, Association for Computing Machinery (ACM) Digital Library and Association for Computational Linguistics (ACL) Anthology. We reviewed and summarised the study objectives; NLP methods used and their validation; software implementations; the performance on the dataset used and any reported use in practice. RESULTS: The main objectives of the 38 included papers were encoding and extraction of clinically relevant information from pathology reports. Common approaches were word/phrase matching, probabilistic machine learning and rule-based systems. Five papers (13%) compared different methods on the same dataset. Four papers did not specify the method(s) used. 18 of the 26 studies that reported F-measure, recall or precision reported values of over 0.9. Proprietary software was the most frequently mentioned category (14 studies); General Architecture for Text Engineering (GATE) was the most applied architecture overall. Practical system use was reported in four papers. Most papers used expert annotation validation. CONCLUSIONS: Different methods are used in NLP research in pathology, and good performances, that is, high precision and recall, high retrieval/removal rates, are reported for all of these. Lack of validation and of shared datasets precludes performance comparison. More comparative analysis and validation are needed to provide better insight into the performance and merits of these methods. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://www.bmj.com/company/products-services/rights-and-licensing/
Authors: Jason P Lott; Denise M Boudreau; Ray L Barnhill; Martin A Weinstock; Eleanor Knopp; Michael W Piepkorn; David E Elder; Steven R Knezevich; Andrew Baer; Anna N A Tosteson; Joann G Elmore Journal: JAMA Dermatol Date: 2018-01-01 Impact factor: 10.282
Authors: Jan U Becker; David Mayerich; Meghana Padmanabhan; Jonathan Barratt; Angela Ernst; Peter Boor; Pietro A Cicalese; Chandra Mohan; Hien V Nguyen; Badrinath Roysam Journal: Kidney Int Date: 2020-04-01 Impact factor: 10.612
Authors: John Harry Caufield; Dibakar Sigdel; John Fu; Howard Choi; Vladimir Guevara-Gonzalez; Ding Wang; Peipei Ping Journal: Cardiovasc Res Date: 2022-02-21 Impact factor: 13.081
Authors: Anthony Nguyen; John O'Dwyer; Thanh Vu; Penelope M Webb; Sharon E Johnatty; Amanda B Spurdle Journal: BMJ Open Date: 2020-06-11 Impact factor: 2.692