OBJECTIVE: The objective of this study is to present a data quality assurance program for disparate data sources loaded into a Common Data Model, highlight data quality issues identified and resolutions implemented. BACKGROUND: The Observational Medical Outcomes Partnership is conducting methodological research to develop a system to monitor drug safety. Standard processes and tools are needed to ensure continuous data quality across a network of disparate databases, and to ensure that procedures used to extract-transform-load (ETL) processes maintain data integrity. Currently, there is no consensus or standard approach to evaluate the quality of the source data, or ETL procedures. METHODS: We propose a framework for a comprehensive process to ensure data quality throughout the steps used to process and analyze the data. The approach used to manage data anomalies includes: (1) characterization of data sources; (2) detection of data anomalies; (3) determining the cause of data anomalies; and (4) remediation. FINDINGS: Data anomalies included incomplete raw dataset: no race or year of birth recorded. Implausible data: year of birth exceeding current year, observation period end date precedes start date, suspicious data frequencies and proportions outside normal range. Examples of errors found in the ETL process were zip codes incorrectly loaded, drug quantities rounded, drug exposure length incorrectly calculated, and condition length incorrectly programmed. CONCLUSIONS: Complete and reliable observational data are difficult to obtain, data quality assurance processes need to be continuous as data is regularly updated; consequently, processes to assess data quality should be ongoing and transparent.
OBJECTIVE: The objective of this study is to present a data quality assurance program for disparate data sources loaded into a Common Data Model, highlight data quality issues identified and resolutions implemented. BACKGROUND: The Observational Medical Outcomes Partnership is conducting methodological research to develop a system to monitor drug safety. Standard processes and tools are needed to ensure continuous data quality across a network of disparate databases, and to ensure that procedures used to extract-transform-load (ETL) processes maintain data integrity. Currently, there is no consensus or standard approach to evaluate the quality of the source data, or ETL procedures. METHODS: We propose a framework for a comprehensive process to ensure data quality throughout the steps used to process and analyze the data. The approach used to manage data anomalies includes: (1) characterization of data sources; (2) detection of data anomalies; (3) determining the cause of data anomalies; and (4) remediation. FINDINGS: Data anomalies included incomplete raw dataset: no race or year of birth recorded. Implausible data: year of birth exceeding current year, observation period end date precedes start date, suspicious data frequencies and proportions outside normal range. Examples of errors found in the ETL process were zip codes incorrectly loaded, drug quantities rounded, drug exposure length incorrectly calculated, and condition length incorrectly programmed. CONCLUSIONS: Complete and reliable observational data are difficult to obtain, data quality assurance processes need to be continuous as data is regularly updated; consequently, processes to assess data quality should be ongoing and transparent.
Authors: Preciosa M Coloma; Gianluca Trifirò; Martijn J Schuemie; Rosa Gini; Ron Herings; Julia Hippisley-Cox; Giampiero Mazzaglia; Gino Picelli; Giovanni Corrao; Lars Pedersen; Johan van der Lei; Miriam Sturkenboom Journal: Pharmacoepidemiol Drug Saf Date: 2012-02-08 Impact factor: 2.890
Authors: James D Lewis; Rita Schinnar; Warren B Bilker; Xingmei Wang; Brian L Strom Journal: Pharmacoepidemiol Drug Saf Date: 2007-04 Impact factor: 2.890
Authors: Stephanie J Reisinger; Patrick B Ryan; Donald J O'Hara; Gregory E Powell; Jeffery L Painter; Edward N Pattishall; Jonathan A Morris Journal: J Am Med Inform Assoc Date: 2010 Nov-Dec Impact factor: 4.497
Authors: Sean Hennessy; Charles E Leonard; Cristin P Freeman; Rajat Deo; Craig Newcomb; Stephen E Kimmel; Brian L Strom; Warren B Bilker Journal: Pharmacoepidemiol Drug Saf Date: 2010-06 Impact factor: 2.890
Authors: Donald R Miller; Susan A Oliveria; Dan R Berlowitz; Benjamin G Fincke; Paul Stang; David E Lillienfeld Journal: Hypertension Date: 2008-04-14 Impact factor: 10.190
Authors: Samantha J Lain; Christine L Roberts; Ruth M Hadfield; Jane C Bell; Jonathan M Morris Journal: Aust N Z J Obstet Gynaecol Date: 2008-10 Impact factor: 2.100
Authors: Orfeas Liangos; Ron Wald; John W O'Bell; Lorilyn Price; Brian J Pereira; Bertrand L Jaber Journal: Clin J Am Soc Nephrol Date: 2005-10-26 Impact factor: 8.237
Authors: Martijn J Schuemie; Rosa Gini; Preciosa M Coloma; Huub Straatman; Ron M C Herings; Lars Pedersen; Francesco Innocenti; Giampiero Mazzaglia; Gino Picelli; Johan van der Lei; Miriam C J M Sturkenboom Journal: Drug Saf Date: 2013-10 Impact factor: 5.606
Authors: Michael G Kahn; Jeffrey S Brown; Alein T Chun; Bruce N Davidson; Daniella Meeker; Patrick B Ryan; Lisa M Schilling; Nicole G Weiskopf; Andrew E Williams; Meredith Nahm Zozus Journal: EGEMS (Wash DC) Date: 2015-03-23
Authors: Rosa Gini; Martijn Schuemie; Jeffrey Brown; Patrick Ryan; Edoardo Vacchi; Massimo Coppola; Walter Cazzola; Preciosa Coloma; Roberto Berni; Gayo Diallo; José Luis Oliveira; Paul Avillach; Gianluca Trifirò; Peter Rijnbeek; Mariadonata Bellentani; Johan van Der Lei; Niek Klazinga; Miriam Sturkenboom Journal: EGEMS (Wash DC) Date: 2016-02-08
Authors: Tiffany J Callahan; Alan E Bauck; David Bertoch; Jeff Brown; Ritu Khare; Patrick B Ryan; Jenny Staab; Meredith N Zozus; Michael G Kahn Journal: EGEMS (Wash DC) Date: 2017-06-12