Roberto Álvarez Sánchez1, Andoni Beristain Iraola2, Gorka Epelde Unanue2, Paul Carlin3. 1. Vicomtech, Paseo Mikeletegi 57 Parque Científico y Tecnológico de Gipuzkoa, Donostia/San Sebastián 20009, Gipuzkoa, Spain; IIS Biodonostia, Paseo Doctor Beguiristain s/n, Donostia/San Sebastián, 20014, Gipuzkoa, Spain. Electronic address: ralvarez@vicomtech.org. 2. Vicomtech, Paseo Mikeletegi 57 Parque Científico y Tecnológico de Gipuzkoa, Donostia/San Sebastián 20009, Gipuzkoa, Spain; IIS Biodonostia, Paseo Doctor Beguiristain s/n, Donostia/San Sebastián, 20014, Gipuzkoa, Spain. 3. South Eastern Health and Social Care Trust, Upper Newtownards Road, Belfast, BT16 1RH, United Kingdom.
Abstract
BACKGROUND AND OBJECTIVES: Data curation is a tedious task but of paramount relevance for data analytics and more specially in the health context where data-driven decisions must be extremely accurate. The ambition of TAQIH is to support non-technical users on 1) the exploratory data analysis (EDA) process of tabular health data, and 2) the assessment and improvement of its quality. METHODS: A web-based tool has been implemented with a simple yet powerful visual interface. First, it provides interfaces to understand the dataset, to gain the understanding of the content, structure and distribution. Then, it provides data visualization and improvement utilities for the data quality dimensions of completeness, accuracy, redundancy and readability. RESULTS: It has been applied in two different scenarios. (1) The Northern Ireland General Practitioners (GPs) Prescription Data, an open data set containing drug prescriptions. (2) A glucose monitoring tele health system dataset. Findings on (1) include: Features that had significant amount of missing values (e.g. AMP_NM variable 53.39%); instances that have high percentage of variable values missing (e.g. 0.21% of the instances with > 75% of missing values); highly correlated variables (e.g. Gross and Actual cost almost completely correlated (∼ + 1.0)). Findings on (2) include: Features that had significant amount of missing values (e.g. patient height, weight and body mass index (BMI) (> 70%), date of diagnosis 13%)); highly correlated variables (e.g. height, weight and BMI). Full detail of the testing and insights related to findings are reported. CONCLUSIONS: TAQIH enables and supports users to carry out EDA on tabular health data and to assess and improve its quality. Having the layout of the application menu arranged sequentially as the conventional EDA pipeline helps following a consistent analysis process. The general description of the dataset and features section is very useful for the first overview of the dataset. The missing value heatmap is also very helpful in visually identifying correlations among missing values. The correlations section has proved to be supportive as a preliminary step before further data analysis pipelines, as well as the outliers section. Finally, the data quality section provides a quantitative value to the dataset improvements.
BACKGROUND AND OBJECTIVES: Data curation is a tedious task but of paramount relevance for data analytics and more specially in the health context where data-driven decisions must be extremely accurate. The ambition of TAQIH is to support non-technical users on 1) the exploratory data analysis (EDA) process of tabular health data, and 2) the assessment and improvement of its quality. METHODS: A web-based tool has been implemented with a simple yet powerful visual interface. First, it provides interfaces to understand the dataset, to gain the understanding of the content, structure and distribution. Then, it provides data visualization and improvement utilities for the data quality dimensions of completeness, accuracy, redundancy and readability. RESULTS: It has been applied in two different scenarios. (1) The Northern Ireland General Practitioners (GPs) Prescription Data, an open data set containing drug prescriptions. (2) A glucose monitoring tele health system dataset. Findings on (1) include: Features that had significant amount of missing values (e.g. AMP_NM variable 53.39%); instances that have high percentage of variable values missing (e.g. 0.21% of the instances with > 75% of missing values); highly correlated variables (e.g. Gross and Actual cost almost completely correlated (∼ + 1.0)). Findings on (2) include: Features that had significant amount of missing values (e.g. patient height, weight and body mass index (BMI) (> 70%), date of diagnosis 13%)); highly correlated variables (e.g. height, weight and BMI). Full detail of the testing and insights related to findings are reported. CONCLUSIONS: TAQIH enables and supports users to carry out EDA on tabular health data and to assess and improve its quality. Having the layout of the application menu arranged sequentially as the conventional EDA pipeline helps following a consistent analysis process. The general description of the dataset and features section is very useful for the first overview of the dataset. The missing value heatmap is also very helpful in visually identifying correlations among missing values. The correlations section has proved to be supportive as a preliminary step before further data analysis pipelines, as well as the outliers section. Finally, the data quality section provides a quantitative value to the dataset improvements.
Authors: Siaw-Teng Liaw; Jason Guan Nan Guo; Sameera Ansari; Jitendra Jonnagaddala; Myron Anthony Godinho; Alder Jose Borelli; Simon de Lusignan; Daniel Capurro; Harshana Liyanage; Navreet Bhattal; Vicki Bennett; Jaclyn Chan; Michael G Kahn Journal: J Am Med Inform Assoc Date: 2021-07-14 Impact factor: 4.497
Authors: Xi Shi; Gorana Nikolic; Scott Fischaber; Michaela Black; Debbie Rankin; Gorka Epelde; Andoni Beristain; Roberto Alvarez; Monica Arrue; Joao Pita Costa; Marko Grobelnik; Luka Stopar; Juha Pajula; Adil Umer; Peter Poliwoda; Jonathan Wallace; Paul Carlin; Jarmo Pääkkönen; Bart De Moor Journal: Front Public Health Date: 2022-03-31