| Literature DB >> 28912836 |
Spiros Denaxas1,2, Kenan Direk1,2, Arturo Gonzalez-Izquierdo1,2, Maria Pikoula1,2, Aylin Cakiroglu3, Jason Moore4, Harry Hemingway1,2, Liam Smeeth5.
Abstract
BACKGROUND: The ability of external investigators to reproduce published scientific findings is critical for the evaluation and validation of biomedical research by the wider community. However, a substantial proportion of health research using electronic health records (EHR), data collected and generated during clinical care, is potentially not reproducible mainly due to the fact that the implementation details of most data preprocessing, cleaning, phenotyping and analysis approaches are not systematically made available or shared. With the complexity, volume and variety of electronic health record data sources made available for research steadily increasing, it is critical to ensure that scientific findings from EHR data are reproducible and replicable by researchers. Reporting guidelines, such as RECORD and STROBE, have set a solid foundation by recommending a series of items for researchers to include in their research outputs. Researchers however often lack the technical tools and methodological approaches to actuate such recommendations in an efficient and sustainable manner.Entities:
Keywords: Biomedical research; Electronic health records; Reproducibility; Transparency
Year: 2017 PMID: 28912836 PMCID: PMC5594436 DOI: 10.1186/s13040-017-0151-7
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Fig. 1Analytical challenges associated with using big health data for biomedical research span the methodological, ethical, analytical and translational domains
Fig. 2A generic EHR analytical pipeline can generally be split into several smaller distinct stages which are often executed in an iterative fashion: 1) raw EHR data are pre-processed, linked and transformed into statistically-analyzable datasets 2) data undergo statistical analyses and 3) scientific findings are presented and disseminated in terms of data, figures, narrative and tables in scientific output
REporting of studies Conducted using Observational Routinely collected Data (RECORD) recommendations on reporting details around EHR algorithms used to define the study populations, exposures and outcomes
| RECORD guideline principle | Description |
|---|---|
| id number | |
| 6.1 | The methods of study population selection (such as codes or algorithms used to identify subjects) should be listed in detail. |
| 7.1 | A complete list of codes and algorithms used to classify exposures, outcomes, confounders, and effect modifiers should be provided. |
| 13.1 | Describe in detail the selection of the persons included in the study (i.e., study population selection) including filtering based on data quality, data availability and linkage. |
| 22.1 | Authors should provide information on how to access any supplemental information such as the study protocol, raw data or programming code. |
Methods and approaches that can enable the reproducibility of biomedical research findings using electronic health records
| Method/approach | Recommendations |
|---|---|
| Scientific software engineering principles | Create generic functions for common EHR data cleaning and preprocessing operations which can be shared with the community |
| Produce functions for defining study exposures, covariates and clinical outcomes across datasets which can be maintained across research groups and reused across many research studies | |
| Create modules for logically grouping common EHR operations e.g. study population definitions or datasource manipulation to enable code maintainability | |
| Create tests for individual functions and modules to ensure the robustness and correctness of results | |
| Track changes in analytical code and phenotypt definitions using controlled clinical terminology terms by making use of a source code revision control system | |
| Use formal software engineering best-practices to document workflows and data manipulation operations | |
| Standardized analytical approaches | Build and distribute libraries for common EHR data manipulation or statistical analysis and include sufficient detail (e.g. command line arguments) for all tools used |
| Produce and annotate machine-readable EHR phenotyping algorithms that can be systematically curated and reused by the community | |
| Use Digital Object Identifiers (DOIs) for transforming research artifacts into shareable citable resources and cross-reference from research output | |
| Deposit research resources (e.g. algorithms, code) in open-access repositories or software scientific journals and cross-reference from research output | |
| Virtual machines can potentially be used to encapsulate the data, operating system, analytical software and algorithms used to generate a manuscript and where applicable can be made available for others to reproduce the analytical pipeline. | |
| Literate programming | Encapsulate both logic and programming code using literate programming approaches and tools which ensure logic and underlying processing code coexist |
Example of an R function for converting lipid measurements between mmol/L and mg/dL units
Function arguments (value and units) are validated prior to performing the calculation and an error is raised if incorrect or missing parameters are supplied
Fig. 3Simple example of a Unified Modelling Language (UML) class diagram Class diagrams are static representations that describe the structure fo a system by showing the system’s classes and the relationships between them. Classes are represented as boxes with three sections: the first one contains the name of the class, the second one contains the attributes of the class and the third one contains the methods. Class diagrams also illustrate the relationships (and their multiplicity) between different classes. In this instance, a patient can be assigned to a single ward within a hospital whereas a ward can have multiple patients admitted at any time (depicted as 1..*)
Using the RUnit library to perform unit tests for a function converting measurements of lipids from mmol/L to mg/dL
Combinations of valid, invalid, and missing function parameters are tested and the output returned from the function is examined
Example of using git to initialize an empty repository and track changes in a versioned file defining a study cohort
Fig. 4Example of an algorithm managed by version control software Example of an algorithm managed by version control software. The master algorithm version is located on the main development line that is not on a branch, often called a trunk or master, is in green. An individual refinement branch, currently being worked on without affecting the main version is in green and is eventually merged with the main development version [95]
Fig. 5Example of using the Knitr R package to produce a dynamic report with embedded R code and results including a plot. Documentation and data processing code chunks are written in plain text in a file that is processed as RMarkdown. At the top of the file, a series of key: value pair statements in YAML set document metadata such as the title and the output format. Code chunks are enclosed between ``` characters and executed when the document is compiled. Parameters such as echo and display can be set to specify whether the results of executing the code or whether the actual code itself is displayed. Example taken from http://jupyter.org/
Examples of packages and libraries supporting literate programming and report generation in popular analytical/statistical software packages
| Statistical/Analytical tool | Relevant packages |
|---|---|
| R | RMarkdown, Knitr, Sweave, Roxygen |
| Stata | MarkDoc, Weaver, Ketchup |
| Python | Jupyter Notebook, Doxygen |
| Matlab | Doxygen (limited) |
| Octave | Doxygen [ |
| SAS | SASWeave [ |