Matthias W Lorenz1, Negin Ashtiani Abdi2, Frank Scheckenbach3, Anja Pflug3, Alpaslan Bülbül3, Alberico L Catapano4,5, Stefan Agewall6,7, Marat Ezhov8, Michiel L Bots9,10, Stefan Kiechl11, Andreas Orth2. 1. Department of Neurology, University Clinic Frankfurt, Schleusenweg 2-16, D-60528, Frankfurt/Main, Germany. Matthias.lorenz@em.uni-frankfurt.de. 2. Faculty of Computer Science and Engineering, Frankfurt University of Applied Sciences, Frankfurt/Main, Germany. 3. Department of Neurology, University Clinic Frankfurt, Schleusenweg 2-16, D-60528, Frankfurt/Main, Germany. 4. IRCSS Multimedica, Milan, Italy. 5. Department of Pharmacological and Biomolecular Sciences, University of Milan, Milan, Italy. 6. Institute of Clinical Sciences, University of Oslo, Oslo, Norway. 7. Department of Cardiology, Oslo University Hospital Ullevål, Oslo, Norway. 8. Atherosclerosis Department, Cardiology Research Center, Moscow, Russia. 9. University Medical Center Utrecht, Utrecht, The Netherlands. 10. Department of Epidemiology and Biostatistics, Erasmus Medical Center, Rotterdam, The Netherlands. 11. Department of Neurology, Medical University Innsbruck, Innsbruck, Austria.
Abstract
BACKGROUND: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. METHODS: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. RESULTS: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. CONCLUSIONS: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.
BACKGROUND: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. METHODS: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. RESULTS: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. CONCLUSIONS: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.
Entities:
Keywords:
Data management; Epidemiology; Logic regression; Meta-analysis
Authors: Charles Kooperberg; Joshua C Bis; Kristin D Marciante; Susan R Heckbert; Thomas Lumley; Bruce M Psaty Journal: Am J Epidemiol Date: 2006-11-02 Impact factor: 4.897
Authors: Isabel Fortier; Dany Doiron; Julian Little; Vincent Ferretti; François L'Heureux; Ronald P Stolk; Bartha M Knoppers; Thomas J Hudson; Paul R Burton Journal: Int J Epidemiol Date: 2011-07-30 Impact factor: 7.196
Authors: Lesley A Stewart; Mike Clarke; Maroeska Rovers; Richard D Riley; Mark Simmonds; Gavin Stewart; Jayne F Tierney Journal: JAMA Date: 2015-04-28 Impact factor: 56.272
Authors: Vincent Ferretti; Isabel Fortier; Dany Doiron; Paul Burton; Yannick Marcon; Amadou Gaye; Bruce H R Wolffenbuttel; Markus Perola; Ronald P Stolk; Luisa Foco; Cosetta Minelli; Melanie Waldenberger; Rolf Holle; Kirsti Kvaløy; Hans L Hillege; Anne-Marie Tassé Journal: Emerg Themes Epidemiol Date: 2013-11-21
Authors: Thomas P A Debray; Karel G M Moons; Ghada Mohammed Abdallah Abo-Zaid; Hendrik Koffijberg; Richard David Riley Journal: PLoS One Date: 2013-04-09 Impact factor: 3.240
Authors: Shan Jiang; Joshua L Warren; Noah Scovronick; Shannon E Moss; Lyndsey A Darrow; Matthew J Strickland; Andrew J Newman; Yong Chen; Stefanie T Ebelt; Howard H Chang Journal: BMC Med Res Methodol Date: 2021-04-26 Impact factor: 4.615