Carlos Sáez1,2, Oscar Zurriaga3,4,5, Jordi Pérez-Panadés3, Inma Melchor3, Montserrat Robles6, Juan M García-Gómez6,7. 1. Instituto Universitario de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas. Universitat Politècnica de València. Camino de Vera s/n. 46022 Valencia, España. carsaesi@ibime.upv.es. 2. Centre for Health Technologies and Services Research, University of Porto, Porto, Portugal. 3. Dirección General de Salud Pública, Conselleria de Sanidad, Valencia, Spain. 4. FISABIO - Salud Pública, Consellería de Sanidad, Valencia, Spain. 5. CIBERESP, Madrid, Spain. 6. Instituto Universitario de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas. Universitat Politècnica de València. Camino de Vera s/n. 46022 Valencia, España. 7. Unidad Mixta de Investigación en TICs aplicadas a la Reingeniería de Procesos Sociosanitarios (eRPSS), Instituto de Investigación Sanitaria del Hospital Universitario y Politécnico La Fe, Valencia, Spain.
Abstract
OBJECTIVE: To assess the variability in data distributions among data sources and over time through a case study of a large multisite repository as a systematic approach to data quality (DQ). MATERIALS AND METHODS: Novel probabilistic DQ control methods based on information theory and geometry are applied to the Public Health Mortality Registry of the Region of Valencia, Spain, with 512 143 entries from 2000 to 2012, disaggregated into 24 health departments. The methods provide DQ metrics and exploratory visualizations for (1) assessing the variability among multiple sources and (2) monitoring and exploring changes with time. The methods are suited to big data and multitype, multivariate, and multimodal data. RESULTS: The repository was partitioned into 2 probabilistically separated temporal subgroups following a change in the Spanish National Death Certificate in 2009. Punctual temporal anomalies were noticed due to a punctual increment in the missing data, along with outlying and clustered health departments due to differences in populations or in practices. DISCUSSION: Changes in protocols, differences in populations, biased practices, or other systematic DQ problems affected data variability. Even if semantic and integration aspects are addressed in data sharing infrastructures, probabilistic variability may still be present. Solutions include fixing or excluding data and analyzing different sites or time periods separately. A systematic approach to assessing temporal and multisite variability is proposed. CONCLUSION: Multisite and temporal variability in data distributions affects DQ, hindering data reuse, and an assessment of such variability should be a part of systematic DQ procedures.
OBJECTIVE: To assess the variability in data distributions among data sources and over time through a case study of a large multisite repository as a systematic approach to data quality (DQ). MATERIALS AND METHODS: Novel probabilistic DQ control methods based on information theory and geometry are applied to the Public Health Mortality Registry of the Region of Valencia, Spain, with 512 143 entries from 2000 to 2012, disaggregated into 24 health departments. The methods provide DQ metrics and exploratory visualizations for (1) assessing the variability among multiple sources and (2) monitoring and exploring changes with time. The methods are suited to big data and multitype, multivariate, and multimodal data. RESULTS: The repository was partitioned into 2 probabilistically separated temporal subgroups following a change in the Spanish National Death Certificate in 2009. Punctual temporal anomalies were noticed due to a punctual increment in the missing data, along with outlying and clustered health departments due to differences in populations or in practices. DISCUSSION: Changes in protocols, differences in populations, biased practices, or other systematic DQ problems affected data variability. Even if semantic and integration aspects are addressed in data sharing infrastructures, probabilistic variability may still be present. Solutions include fixing or excluding data and analyzing different sites or time periods separately. A systematic approach to assessing temporal and multisite variability is proposed. CONCLUSION: Multisite and temporal variability in data distributions affects DQ, hindering data reuse, and an assessment of such variability should be a part of systematic DQ procedures.
Authors: Omar Del Tejo Catala; Ismael Salvador Igual; Francisco Javier Perez-Benito; David Millan Escriva; Vicent Ortiz Castello; Rafael Llobet; Juan-Carlos Perez-Cortes Journal: IEEE Access Date: 2021-03-10 Impact factor: 3.476
Authors: Lexin Zhou; Nekane Romero-García; Juan Martínez-Miranda; J Alberto Conejero; Juan M García-Gómez; Carlos Sáez Journal: JMIR Public Health Surveill Date: 2022-03-30
Authors: Carlos Sáez; Alba Gutiérrez-Sacristán; Isaac Kohane; Juan M García-Gómez; Paul Avillach Journal: Gigascience Date: 2020-08-01 Impact factor: 6.524
Authors: Francisco Javier Pérez-Benito; Carlos Sáez; J Alberto Conejero; Salvador Tortajada; Bernardo Valdivieso; Juan M García-Gómez Journal: PLoS One Date: 2019-08-07 Impact factor: 3.240
Authors: Patrick Rockenschaub; Vincent Nguyen; Robert W Aldridge; Dionisio Acosta; Juan Miguel García-Gómez; Carlos Sáez Journal: BMJ Open Date: 2020-02-13 Impact factor: 2.692