Sara Santiso1, Alicia Pérez2, Arantza Casillas3. 1. IXA group, University of the Basque Country (UPV-EHU), Manuel Lardizabal 1, 20080 Donostia, Spain. Electronic address: sara.santiso@ehu.eus. 2. IXA group, University of the Basque Country (UPV-EHU), Manuel Lardizabal 1, 20080 Donostia, Spain. Electronic address: alicia.perez@ehu.eus. 3. IXA group, University of the Basque Country (UPV-EHU), Manuel Lardizabal 1, 20080 Donostia, Spain. Electronic address: arantza.casillas@ehu.eus.
Abstract
BACKGROUND AND OBJECTIVE: This work aims at extracting Adverse Drug Reactions (ADRs), i.e. a harm directly caused by a drug at normal doses, from Electronic Health Records (EHRs). The lack of readily available EHRs because of confidentiality issues and their lexical variability make the ADR extraction challenging. Furthermore, ADRs are rare events. Therefore, efficient representations against data sparsity are needed. METHODS: Embedding-based characterizations are able to group semantically related words. However, dense spaces suffer from data sparsity. We employed context-aware continuous representations to enhance the modelling of infrequent events through their context and we turned to simple smoothing techniques to increase the proximity between similar words (e.g. direction cosines, truncation, Principal Component Analysis (PCA) and clustering) in an attempt to cope with data sparsity. RESULTS: An F-measure of 0.639 for the ADR classification was achieved, obtaining an improvement of approximately 0.300 in comparison with the results obtained by a word-based characterization. CONCLUSION: The embbeding-based representation together with the smoothing techniques increased the robustness of the ADR characterization. It was proven particularly appropriate to cope with lexical variability and data sparsity.
BACKGROUND AND OBJECTIVE: This work aims at extracting Adverse Drug Reactions (ADRs), i.e. a harm directly caused by a drug at normal doses, from Electronic Health Records (EHRs). The lack of readily available EHRs because of confidentiality issues and their lexical variability make the ADR extraction challenging. Furthermore, ADRs are rare events. Therefore, efficient representations against data sparsity are needed. METHODS: Embedding-based characterizations are able to group semantically related words. However, dense spaces suffer from data sparsity. We employed context-aware continuous representations to enhance the modelling of infrequent events through their context and we turned to simple smoothing techniques to increase the proximity between similar words (e.g. direction cosines, truncation, Principal Component Analysis (PCA) and clustering) in an attempt to cope with data sparsity. RESULTS: An F-measure of 0.639 for the ADR classification was achieved, obtaining an improvement of approximately 0.300 in comparison with the results obtained by a word-based characterization. CONCLUSION: The embbeding-based representation together with the smoothing techniques increased the robustness of the ADR characterization. It was proven particularly appropriate to cope with lexical variability and data sparsity.