| Literature DB >> 28913963 |
Shirley V Wang1,2, Sebastian Schneeweiss1,2, Marc L Berger3, Jeffrey Brown4, Frank de Vries5, Ian Douglas6, Joshua J Gagne1,2, Rosa Gini7, Olaf Klungel8, C Daniel Mullins9, Michael D Nguyen10, Jeremy A Rassen11, Liam Smeeth6, Miriam Sturkenboom12.
Abstract
PURPOSE: Defining a study population and creating an analytic dataset from longitudinal healthcare databases involves many decisions. Our objective was to catalogue scientific decisions underpinning study execution that should be reported to facilitate replication and enable assessment of validity of studies conducted in large healthcare databases.Entities:
Keywords: Transparency; healthcare databases; longitudinal data; methods; pharmacoepidemiology; replication; reproducibility
Mesh:
Year: 2017 PMID: 28913963 PMCID: PMC5639362 DOI: 10.1002/pds.4295
Source DB: PubMed Journal: Pharmacoepidemiol Drug Saf ISSN: 1053-8569 Impact factor: 2.890
Figure 1Data provenance: transitions from healthcare delivery to analysis results. [Colour figure can be viewed at wileyonlinelibrary.com]
Reproducibility and replicability
Reporting specific parameters to increase reproducibility of database studies*
| Description | Example | Synonyms | |
|---|---|---|---|
|
| |||
| A.1 Data provider | Data source name and name of organization that provided data. | Medicaid Analytic Extracts data covering 50 states from the Centers for Medicare and Medicaid Services. | |
|
| The date (or version number) when data were extracted from the dynamic raw transactional data stream (e.g. date that the data were cut for research use by the vendor). | The source data for this research study was cut by [data vendor] on January 1st, 2017. The study included administrative claims from Jan 1st 2005 to Dec 31st 2015. | Data version, data pull |
| A.3 Data sampling | The search/extraction criteria applied if the source data accessible to the researcher is a subset of the data available from the vendor. | ||
|
| The calendar time range of data used for the study. Note that the implemented study may use only a subset of the available data. | Study period, query period | |
| A.5 Type of data | The domains of information available in the source data, e.g. administrative, electronic health records, inpatient versus outpatient capture, primary vs secondary care, pharmacy, lab, registry. | The administrative claims data include enrollment information, inpatient and outpatient diagnosis (ICD9/10) and procedure (ICD9/10, CPT, HCPCS) codes as well as outpatient dispensations (NDC codes) for 60 million lives covered by Insurance X. The electronic health records data include diagnosis and procedure codes from billing records, problem list entries, vital signs, prescription and laboratory orders, laboratory results, inpatient medication dispensation, as well as unstructured text found in clinical notes and reports for 100,000 patients with encounters at ABC integrated healthcare system. | |
| A.6 Data linkage, other supplemental data | Data linkage or supplemental data such as chart reviews or survey data not typically available with license for healthcare database. | We used Surveillance, Epidemiology, and End Results (SEER) data on prostate cancer cases from 1990 through 2013 linked to Medicare and a 5% sample of Medicare enrollees living in the same regions as the identified cases of prostate cancer over the same period of time. The linkage was created through a collaborative effort from the National Cancer Institute (NCI), and the Centers for Medicare and Medicaid Services (CMS). | |
| A.7 Data cleaning | Transformations to the data fields to handle missing, out of range values or logical inconsistencies. This may be at the data source level or the decisions can be made on a project specific basis. | Global cleaning: The data source was cleaned to exclude all individuals who had more than one gender reported. All dispensing claims that were missing day's supply or had 0 days’ supply were removed from the source data tables. Project specific cleaning: When calculating duration of exposure for our study population, we ignored dispensation claims that were missing or had 0 days’ supply. We used the most recently reported birth date if there was more than one birth date reported. | |
| A.8 Data model conversion | Format of the data, including description of decisions used to convert data to fit a Common Data Model (CDM). | The source data were converted to fit the Sentinel Common Data Model (CDM) version 5.0. Data conversion decisions can be found on our website (http://ourwebsite). Observations with missing or out of range values were not removed from the CDM tables. | |
|
| |||
| B.1 Design diagram | A figure that contains 1st and 2nd order temporal anchors and depicts their relation to each other. | See example Figure | |
|
| |||
|
| The date(s) when subjects enter the cohort. | We identified the first SED for each patient. Patients were included if all other inclusion/exclusion criteria were met at the first SED. We identified all SED for each patient. Patients entered the cohort only once, at the first SED where all other inclusion/exclusion criteria were met. We identified all SED for each patient. Patients entered the cohort at every SED where all other inclusion/exclusion criteria were met. | Index date, cohort entry date, outcome date, case date, qualifying event date, sentinel event |
| C.2 Person or episode level study entry | The type of entry to the cohort. For example, at the individual level (1x entry only) or at the episode level (multiple entries, each time inclusion/exclusion criteria met). | Single vs multiple entry, treatment episodes, drug eras | |
| C.3 Sequencing of exclusions | The order in which exclusion criteria are applied, specifically whether they are applied before or after the selection of the SED(s). | Attrition table, flow diagram, CONSORT diagram | |
|
| The time window prior to SED in which an individual was required to be contributing to the data source. | Patients entered the cohort on the date of their first dispensation for Drug X or Drug Y after at least 180 days of continuous enrollment (30 day gaps allowed) without dispensings for either Drug X or Drug Y. | Observation window |
| C.5 Enrollment gap | The algorithm for evaluating enrollment prior to SED including whether gaps were allowed. | ||
| C.6 Inclusion/Exclusion definition window | The time window(s) over which inclusion/exclusion criteria are defined. | Exclude from cohort if ICD‐9 codes for deep vein thrombosis (451.1x, 451.2x, 451.81, 451.9x, 453.1x, 453.2x, 453.8x, 453.9x, 453.40, 453.41, 453.42 where x represents presence of a numeric digit 0‐9 or no additional digits) were recorded in the primary diagnosis position during an inpatient stay within the 30 days prior to and including the SED. Invalid ICD‐9 codes that matched the wildcard criteria were excluded. | |
| C.7 Codes | The exact drug, diagnosis, procedure, lab or other codes used to define inclusion/exclusion criteria. | Concepts, vocabulary, class, domain | |
| C.8 Frequency and temporality of codes | The temporal relation of codes in relation to each other as well as the SED. When defining temporality, be clear whether or not the SED is included in assessment windows (e.g. occurred on the same day, 2 codes for A occurred within 7 days of each other during the 30 days prior to and including the SED). | ||
| C.9 Diagnosis position (if relevant/available) | The restrictions on codes to certain positions, e.g. primary vs. secondary. Diagnoses. | ||
| C.10 Care setting | The restrictions on codes to those identified from certain settings, e.g. inpatient, emergency department, nursing home. | Care site, place of service, point of service, provider type | |
| C.11 Washout for exposure | The period used to assess whether exposure at the end of the period represents new exposure. | New initiation was defined as the first dispensation for Drug X after at least 180 days without dispensation for Drug X, Y, and Z. | Lookback for exposure, event free period |
| C.12 Washout for outcome | The period prior to SED or ED to assess whether an outcome is incident. | Patients were excluded if they had a stroke within 180 days prior to and including the cohort entry date. Cases of stroke were excluded if there was a recorded stroke within 180 days prior. | Lookback for outcome, event free period |
|
| |||
| D.1 Type of exposure | The type of exposure that is captured or measured, e.g. drug versus procedure, new use, incident, prevalent, cumulative, time‐varying. | We evaluated risk of outcome Z following incident exposure to drug X or drug Y. Incident exposure was defined as beginning on the day of the first dispensation for one of these drugs after at least 180 days without dispensations for either (SED). Patients with incident exposure to both drug X and drug Y on the same SED were excluded. The exposure risk window for patients with Drug X and Drug Y began 10 days after incident exposure and continued until 14 days past the last days supply, including refills. If a patient refilled early, the date of the early refill and subsequent refills were adjusted so that the full days supply from the initial dispensation was counted before the days supply from the next dispensation was tallied. Gaps of less than or equal to 14 days in between one dispensation plus days supply and the next dispensation for the same drug were bridged (i.e. the time was counted as continuously exposed). If patients exposed to Drug X were dispensed Drug Y or vice versa, exposure was censored. NDC codes used to define incident exposure to drug X and drug Y can be found in the appendix. Drug X was defined by NDC codes listed in the appendix. Brand and generic versions were used to define Drug X. Non pill or tablet formulations and combination pills were excluded. | |
| D.2 Exposure risk window (ERW) | The ERW is specific to an exposure and the outcome under investigation. For drug exposures, it is equivalent to the time between the minimum and maximum hypothesized induction time following ingestion of the molecule. | Drug era, risk window | |
| D.2a Induction period | Days on or following study entry date during which an outcome would not be counted as "exposed time" or "comparator time". | Blackout period | |
| D.2b Stockpiling | The algorithm applied to handle leftover days supply if there are early refills. | ||
| D.2c Bridging exposure episodes | The algorithm applied to handle gaps that are longer than expected if there was perfect adherence (e.g. non‐overlapping dispensation + day's supply). | Episode gap, grace period, persistence window, gap days | |
| D.2d Exposure extension | The algorithm applied to extend exposure past the days supply for the last observed dispensation in a treatment episode. | Event extension | |
| D.3 Switching/add on | The algorithm applied to determine whether exposure should continue if another exposure begins. | Treatment episode truncation indicator | |
| D.4 Codes, frequency and temporality of codes, diagnosis position, care setting | Description in Section C. | Concepts, vocabulary, class, domain, care site, place of service, point of service, provider type | |
|
| A time window during which the exposure status is assessed. Exposure is defined at the end of the period. If the occurrence of exposure defines cohort entry, e.g. new initiator, then the EAW may be a point in time rather than a period. If EAW is after cohort entry, FW must begin after EAW. | We evaluated the effect of treatment intensification vs no intensification following hospitalization on disease progression. Study entry was defined by the discharge date from the hospital. The exposure assessment window started from the day after study entry and continued for 30 days. During this period, we identified whether or not treatment intensified for each patient. Intensification during this 30 day period determined exposure status during follow up. Follow up for disease progression began 31 days following study entry and continued until the firsst censoring criterion was met. | |
|
| |||
|
| The time following cohort entry during which patients are at risk to develop the outcome due to the exposure. FW is based on a biologic exposure risk window defined by minimum and maximum induction times. However, FW also accounts for censoring mechanisms. | Follow up began on the SED and continued until the earliest of discontinuation of study exposure, switching/adding comparator exposure, entry to nursing home, death, or end of study period. We included a biologically plausible induction period, therefore, follow up began 60 days after the SED and continued until the earliest of discontinuation of study exposure, switching/adding comparator exposure, entry to nursing home, death, or end of study period. | |
| E.2 Censoring criteria | The criteria that censor follow up. | ||
|
| |||
|
| The date of an event occurrence. | The ED was defined as the date of first inpatient admission with primary diagnosis 410.x1 after the SED and occurring within the follow up window. | Case date, measure date, observation date |
| F.2 Codes, frequency and temporality of codes, diagnosis position, care setting | Description in Section C. | Concepts, vocabulary, class, domain, care site, place of service, point of service, provider type | |
| F.3. Validation | The performance characteristics of outcome algorithm if previously validated. | The outcome algorithm was validated via chart review in a population of diabetics from data source D (citation). The positive predictive value of the algorithm was 94%. | |
|
| Event measures, observations | ||
|
| The time over which patient covariates are assessed. | We assessed covariates during the 180 days prior to but not including the SED. | Baseline period |
| G.2 Comorbidity/risk score | The components and weights used in calculation of a risk score. | See appendix for example. Note that codes, temporality, diagnosis position and care setting should be specified for each component when applicable. | |
| G.3 Healthcare utilization metrics | The counts of encounters or orders over a specified time period, sometimes stratified by care setting, or type of encounter/order. | We counted the number of generics dispensed for each patient in the CAP. We counted the number of dispensations for each patient in the CAP. We counted the number of outpatient encounters recorded in the CAP. We counted the number of days with outpatient encounters recorded in the CAP. We counted the number of inpatient hospitalizations in the CAP, if admission and discharge dates for different encounters overlapped, these were "rolled up" and counted as 1 hospitalization. | |
| G.4 Codes, frequency and temporality of codes, diagnosis position, care setting | Description in Section C. | Baseline covariates were defined by codes from claims with service dates within 180 days prior to and including the SED. Major upper gastrointestinal bleeding was defined as inpatient hospitalization with: At least one of the following ICD‐9 diagnoses: 531.0x, 531.2x, 531.4x, 531.6x, 532.0x, 532.2x, 532.4x, 532.6x, 533.0x, 533.2x, 533.4x, 533.6x, 534.0x, 534.2x, 534.4x, 534.6x, 578.0 ‐ OR ‐ An ICD‐9 procedure code of: 44.43 ‐ OR ‐ A CPT code 43255 | Concepts, vocabulary, class, domain, care site, place of service, point of service, provider type |
|
| |||
| H.1 Sampling strategy | The strategy applied to sample controls for identified cases (patients with ED meeting all inclusion/exclusion criteria). | We used risk set sampling without replacement to identify controls from our cohort of patients with diagnosed diabetes (inpatient or outpatient ICD‐9 diagnoses of 250.xx in any position). Up to 4 controls were randomly matched to each case on length of time since SED (in months), year of birth and gender. The random seed and sampling code can be found in the online appendix. | |
| H.2 Matching factors | The characteristics used to match controls to cases. | ||
| H.3 Matching ratio | The number of controls matched to cases (fixed or variable ratio). | ||
|
| |||
| I.1 Statistical software program used | The software package, version, settings, packages or analytic procedures. | We used: SAS 9.4 PROC LOGISTIC Cran R v3.2.1 survival package Sentinel's Routine Querying System version 2.1.1 CIDA+PSM | |
Parameters in bold are key temporal anchors
Key temporal anchors in design of a database study 1
| Temporal Anchors | Description |
|---|---|
|
| |
| Data Extraction Date ‐ DED | The date when the data were extracted from the dynamic raw transactional data stream |
| Source Data Range ‐ SDR | The calendar time range of data used for the study. Note that the implemented study may use only a subset of the available data. |
|
| |
| Study Entry Date ‐ SED | The dates when subjects enter the study. |
|
| |
| Enrollment Window ‐ EW | The time window prior to SED in which an individual was required to be contributing to the data source |
| Covariate Assessment Window ‐ CW | The time during which all patient covariates are assessed. Baseline covariate assessment should precede cohort entry in order to avoid adjusting for causal intermediates. |
| Follow‐Up Window ‐ FW | The time following cohort entry during which patients are at risk to develop the outcome due to the exposure. |
| Exposure Assessment Window ‐ EAW | The time window during which the exposure status is assessed. Exposure is defined at the end of the period. If the occurrence of exposure defines cohort entry, e.g. new initiator, then the exposure assessment may be a point in time rather than a window. If exposure assessment is after cohort entry, follow up must begin after exposure assessment. |
| Event Date ‐ ED | The date of an event occurrence following cohort entry |
| Washout for Exposure ‐ WE | The time prior to cohort entry during which there should be no exposure (or comparator). |
| Washout for Outcome ‐ WO | The time prior to cohort entry during which the outcome of interest should not occur |
Anchor dates are key dates; baseline anchors identify the available source data; first order anchor dates define entry to the analytic dataset, and second order anchors are relative to the first order anchor
Figure 2Example design diagram. [Colour figure can be viewed at wileyonlinelibrary.com]