Irene Pérez-Díez1,2, Raúl Pérez-Moraga1,3, Adolfo López-Cerdán1,2, Jose-Maria Salinas-Serrano4, María de la Iglesia-Vayá5,6,7. 1. FISABIO-CIPF Joint Research Unit in Biomedical Imaging. Fundació per al Foment de la Investigació Sanitària i Biomèdica (FISABIO), Av. de Catalunya 21, València, 46020, Spain. 2. Bioinformatics and Biostatistics Unit. Centro de Investigación Príncipe Felipe (CIPF), Carrer d'Eduardo Primo Yúfera 3, València, 46012, Spain. 3. ESI International Chair@CEU-UCH, Departamento de Matemáticas, Física y Ciencias Tecnológicas, Universidad Cardenal Herrera-CEU, CEU Universities, Calle San Bartolomé 55, Alfafara del Patriarca, 46115, Spain. 4. Health Informatics Department, Hospital San Juan de Alicante, Sant Joan d'Alacant, 03550, Spain. 5. FISABIO-CIPF Joint Research Unit in Biomedical Imaging. Fundació per al Foment de la Investigació Sanitària i Biomèdica (FISABIO), Av. de Catalunya 21, València, 46020, Spain. miglesia@cipf.es. 6. Regional ministry of Universal Health and Public Health in Valencia, Carrer de Misser Mascó 31, València, 46010, Spain. miglesia@cipf.es. 7. CIBERSAM, ISCIII, Av. Blasco Ibáñez 15, València, 46010, Spain. miglesia@cipf.es.
Abstract
BACKGROUND: Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers. Anonymization methods must be developed to de-identify documents containing personal information from both patients and medical staff. Although currently there are several anonymization strategies for the English language, they are also language-dependent. Here, we introduce a named entity recognition strategy for Spanish medical texts, translatable to other languages. RESULTS: We tested 4 neural networks on our radiology reports dataset, achieving a recall of 97.18% of the identifying entities. Alongside, we developed a randomization algorithm to substitute the detected entities with new ones from the same category, making it virtually impossible to differentiate real data from synthetic data. The three best architectures were tested with the MEDDOCAN challenge dataset of electronic health records as an external test, achieving a recall of 69.18%. CONCLUSIONS: The strategy proposed, combining named entity recognition tasks with randomization of entities, is suitable for Spanish radiology reports. It does not require a big training corpus, thus it could be easily extended to other languages and medical texts, such as electronic health records.
BACKGROUND: Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers. Anonymization methods must be developed to de-identify documents containing personal information from both patients and medical staff. Although currently there are several anonymization strategies for the English language, they are also language-dependent. Here, we introduce a named entity recognition strategy for Spanish medical texts, translatable to other languages. RESULTS: We tested 4 neural networks on our radiology reports dataset, achieving a recall of 97.18% of the identifying entities. Alongside, we developed a randomization algorithm to substitute the detected entities with new ones from the same category, making it virtually impossible to differentiate real data from synthetic data. The three best architectures were tested with the MEDDOCAN challenge dataset of electronic health records as an external test, achieving a recall of 69.18%. CONCLUSIONS: The strategy proposed, combining named entity recognition tasks with randomization of entities, is suitable for Spanish radiology reports. It does not require a big training corpus, thus it could be easily extended to other languages and medical texts, such as electronic health records.
Entities:
Keywords:
Medical texts; Named entity recognition; Natural language processing; Radiology reports; Spanish
Authors: Hannes Seuss; Peter Dankerl; Matthias Ihle; Andrea Grandjean; Rebecca Hammon; Nicola Kaestle; Peter A Fasching; Christian Maier; Jan Christoph; Martin Sedlmayr; Michael Uder; Alexander Cavallaro; Matthias Hammon Journal: Rofo Date: 2017-03-23
Authors: Harry Hemingway; Folkert W Asselbergs; John Danesh; Richard Dobson; Nikolaos Maniadakis; Aldo Maggioni; Ghislaine J M van Thiel; Maureen Cronin; Gunnar Brobert; Panos Vardas; Stefan D Anker; Diederick E Grobbee; Spiros Denaxas Journal: Eur Heart J Date: 2018-04-21 Impact factor: 29.983