Literature DB >> 34313404

Random control selection for conducting high-throughput adverse drug events screening using large-scale longitudinal health data.

Chien-Wei Chiang¹, Penyue Zhang², Macarius Donneyong³, You Chen⁴, Yu Su⁵, Lang Li¹.

Abstract

Case-control design based high-throughput pharmacoinformatics study using large-scale longitudinal health data is able to detect new adverse drug event (ADEs) signals. Existing control selection approaches for case-control design included the dynamic/super control selection approach. The dynamic/super control selection approach requires all individuals to be evaluated at all ADE case index dates, as the individuals' eligibilities as control depend on ADE/enrollment history. Thus, using large-scale longitudinal health data, the dynamic/super control selection approach requires extraordinarily high computational time. We proposed a random control selection approach in which ADE case index dates were matched by randomly generated control index dates. The random control selection approach does not depend on ADE/enrollment history. It is able to significantly reduce computational time to prepare case-control data sets, as it requires all individuals to be evaluated only once. We compared the performance metrics of all control selection approaches using two large-scale longitudinal health data and a drug-ADE gold standard including 399 drug-ADE pairs. The F-scores for the random control selection approach were between 0.586 and 0.600 compared to between 0.545 and 0.562 for dynamic/super control selection approaches. The random control selection approach was ~ 1000 times faster than dynamic/super control selection approach on preparing case-control data sets. With large-scale longitudinal health data, a case-control design-based pharmacoinformatics study using random control selection is able to generate comparable ADE signals than the existing control selection approaches. The random control selection approach also significantly reduces computational time to prepare the case-control data sets.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34313404 PMCID： PMC8452297 DOI： 10.1002/psp4.12673

Source DB: PubMed Journal: CPT Pharmacometrics Syst Pharmacol ISSN： 2163-8306

WHAT IS THE CURRENT KNOWLEDGE ON THE TOPIC? Large‐scale longitudinal health data and adverse drug event (ADE) phenotyping algorithms have become increasingly available. Traditional methods are highly computationally intensive to conduct high‐throughput ADE screening using large‐scale longitudinal health data. WHAT QUESTION DID THIS STUDY ADDRESS? We propose a computationally efficient control selection approach to conduct case‐control design based on high‐throughput ADE screening using large‐scale longitudinal health data. WHAT DOES THIS STUDY ADD TO OUR KNOWLEDGE? A case‐control design‐based pharmacoinformatics study using randomly selected index dates as controls (i.e., random control selection) has comparable or higher performance metrics compared with existing control selection approaches, whereas the random control selection approach is able to significantly reduce the time. HOW MIGHT THIS CHANGE DRUG DISCOVERY, DEVELOPMENT, AND/OR THERAPEUTICS? Using the random control selection approach, a case‐control design‐based pharmacoinformatics study can be upscaled to screen several hypotheses in short period of time (e.g., 15 min), and identify single drugs and drug combinations with increased ADE risks.

INTRODUCTION

Adverse drug events (ADEs), the unintended pharmacological consequences of correctly administered dugs, are a significant challenge for healthcare practice. Currently, in the United States, ADEs cause ~ 125,000 hospital admissions each year, complicate 53% hospital stays, and cause up to 4.6% of deaths., , Many serious ADEs cannot be detected prior to the drug approval. For instance, in the United States, the times from approval to withdrawal due to safety concerns were 3.4 years for valdecoxib, 4.7 years for tegaserod, and 5.4 years for efalizumab. Traditionally, a pharmacoepidemiological study has been used to investigate prespecified ADE hypothesis from real‐world health data. For instance, a pharmacoepidemiological study can be driven by suspicious ADE case reports. Unlike a pharmacoepidemiological study, a pharmacoinformatic study is not driven by any prespecified hypothesis. A pharmacoinformatic study is a discovery‐driven approach. It screens signals from a large number of drug‐ADE pairs. Thus, a pharmacoinformatic study is able to generate ADE hypotheses for the subsequent pharmacoepidemiological studies, and hence accelerates translational ADE research. Currently, regulatory agencies collect postmarket ADE reports through Spontaneous Reporting Systems (SRSs) for identifying ADE hypothesis. An SRS report usually includes drug usages, ADE outcomes, and other information (i.e., patient demographics). Using SRS, a pharmacoinformatic study is able to screen ADE signals under the case‐control design setting, as the reports can be summarized into a two‐by‐two contingency table by drug status (yes/no) and ADE status (yes/no). Pharmacoinformatic studies have successfully identified ADE signals from SRS databases. Pharmacoinformatic approaches based on two‐by‐two contingency tables (i.e., the case‐control design setting) are also known as disproportion analysis (DPA), as they measure ADE signal by the outcome (i.e., total count of a drug‐ADE pair) to expectation (i.e., expected count of a drug‐ADE pair assuming no association) ratio. Frequentist DPA approaches include the proportional reporting ratio (PRR) and the reporting odds ratio (ROR)., The Empirical Bayesian geometric mean (EBGM) is an empirical Bayesian DPA approach and the information component (IC) is a Bayesian DPA approach., Zhang et al. proposed a three‐component mixture model (3CMM), which provided false discovery rate estimation for DPA signals. In addition to DPA approaches, multivariable approaches, such as multiple logistic regression or regulated logistic regression, have been used to adjust potential confounding variables (i.e., comedications). All these pharmacoinformatic studies have their own validations, and many promising discoveries have been successfully validated., Pharmacoinformatic studies have also identified ADE signals from longitudinal health data including electronic health record (EHR) data and administrative claims data., , Unlike SRS, EHR data and administrative claims data contain individual level and longitudinal information, including clinical outcomes (i.e., diagnoses) and medications (i.e., pharmacy prescriptions/claims). Compared to SRS, EHR data and administrative claims data contain more population groups (i.e., individuals without any ADE), variables (i.e., health conditions other than ADE), and detailed temporal information. Although the utilization of this additional information can improve ADE screening, pharmacoinformatic studies using longitudinal health data require sophisticated epidemiological study designs, such as the case‐control design. Under the case control‐design, the aforementioned pharmacoinformatic approaches for SRS (e.g., DPAs) can be directly applied to longitudinal health data. For instance, Wang et al. identified drug combinations with higher myopathy (i.e., a common muscular ADE) risks from an EHR database using the case‐control design. Although pharmacoinformatic studies generated valuable ADE signals from longitudinal health data, they shall be expanded to investigate: (i) more large‐scale longitudinal health databases, and (ii) much larger numbers of ADEs., First, large‐scale longitudinal health databases become more common and available. For instance, the MarketScan commercial claims and encounters database contains over 40 million patients’ information per year, and it has been cited more than 10,000 times according to the Google Scholar. Second, informatic resources allow a high‐throughput pharmacoinformatics study to screen a large number of ADEs. For instance, algorithms to annotate different coding systems allow more than a hundred ADEs to be identified from longitudinal health data., , Currently, the development of large‐scale longitudinal health data and informatics resources facilitate a high‐throughput pharmacoinformatic study using large‐scale longitudinal health data. However, the control selection process in the case‐control design requires a tremendous amount of time for data preparation. For instance, under the existing incidence density sampling approach, at each ADE case index date (i.e., the health encounter date with an ADE diagnosis), other individuals who had not yet developed the ADE were eligible as controls., Thus, the incidence density sampling approach requires all individuals to be evaluated at all ADE case index dates. Such a process requires a significant amount of computational time. If the data set contains a few million individuals, our experiences show that the incidence density sampling approach may require approximately 1 week to prepare the case‐control data set for a common ADE using a standard computer. Additionally, using the incidence density sampling approach, the selected controls for one ADE cannot be used for another ADE, as the control selection approach depends on the history of the ADE. Thus, the computational time to prepare case‐control data sets is further increased for investigating a large number of ADEs. For over a hundred ADEs, the projected time to prepare all case‐control data sets is over 1 month on a computer cluster or over 1 year on a standard computer. The extraordinarily high computational time to prepare case‐control data sets is a significant challenge, which cannot be addressed by existing approaches. In this study, we propose a random sampling approach (i.e., random control selection) for case‐control design, which is computationally efficient for investigating a large number of ADEs using large‐scale longitudinal health data. To reduce the computational time, we propose to select a random control pool by using random generated control index dates. The proposed approach is able to significantly reduce the computational time, as it only requires all individuals to be evaluated once. Additionally, the random control pool can be used to prepare multiple case‐control data sets without additional computational time. On the contrary, controls for one ADE cannot be used as controls for other ADEs under existing approaches. Thus, the random control selection approach is more computationally efficient for high throughput ADE screening. We used two large‐scale longitudinal health data sets and the Observational Medical Outcomes Partnership (OMOP) gold standard to evaluate the performance metrics of the proposed random control selection approach and existing control selection approaches under the case‐control design.

METHODS

Control selection approaches

As individuals were enrolled in different dates in longitudinal health data, the case‐control design required a baseline period (i.e., 3‐ or 6‐month). The case index date was defined as the date of health encounter with an ADE diagnosis given no ADE diagnosis in the baseline period prior to the encounter date. Under the case‐control design, control index dates were selected for each of the case index dates. For instance, the control index date could be the same date as the corresponding case index date (i.e., matching by ADE case index date), as long as the selected individual was eligible as a control at the case index date. Noting that a control index date should have duration of enrollment longer than the baseline period prior to the control index date. Thus, an individual’s eligibility as a control changed over time depending on the individual’s enrollment history. We investigated the drug exposures prior to all case index dates and control index dates. The existing control selection approaches and the proposed random control selection approach are illustrated in Figure 1, and are summarized below:

FIGURE 1

(a) Algorithm for dynamic/super control selection approach. (b) Algorithm for random control selection approach. ADE, adverse drug event

Dynamic (i.e., the incidence density sampling) control selection (Figure 1a): for an ADE, individuals eligible as control at a case index date should have duration of enrollment longer than the baseline period and had not yet developed the ADE. Control index dates were matched to the corresponding case index dates. Control index dates should be separately generated for different ADEs, as the control selection process for a specific ADE depends on the ADE’s history. Super control selection (Figure 1a): for an ADE, individuals eligible as controls at a case index date should have duration of enrollment longer than the baseline period and had never developed the ADE during the entire enrollment period. Control index dates were matched to the corresponding case index dates. Control index dates should be separately generated for different ADEs, as the control selection process for a specific ADE depends on the ADE’s history. Random control selection (Figure 1b): randomly selected index dates with duration of enrollment longer than the baseline period were gathered as a control pool. Control index dates were not matched to the case index dates. Control index dates in the control pool could be used for all ADEs, as the control selection process did not depend on the ADEs’ histories. (a) Algorithm for dynamic/super control selection approach. (b) Algorithm for random control selection approach. ADE, adverse drug event In a short summary, the proposed random control selection approach relaxed two restrictions: (i) matching by case index date, and (ii) matching by ADE history.

Data preparation

We used two large‐scale longitudinal health databases in this study. The first database was the MarketScan Commercial Claims and Encounters database from 2012 to 2017. The MarketScan database included ~ 43 million individuals per year. The second database was the Indiana Network for Patient Care Common Data Model (INPC‐CDM) from 2004 to 2015. The INPC‐CDM database included ~ 5 million individuals per year. Both databases included individual‐level demographic information (i.e., age and gender), administrative information (i.e., date of hospital visit), diagnoses, procedures, and pharmacy records. The MarketScan database used International Classification of Disease (ICD)‐9/ICD‐10 codes to record diagnoses and National Drug Codes (NDCs) to record pharmacy claims. The INPC‐CDM database used OMOP concept IDs to record diagnoses and RxNorm to record pharmacy prescriptions. More details of the MarketScan database are presented in the Supplementary Data and Codes. Our outcomes were acute myocardial infarction, acute renal failure, acute liver injury, and gastrointestinal bleeding. These ADEs were identified by using ICD‐9/ICD‐10 codes (algorithm given in Table S1). For each ADE, we defined a case as: an ADE diagnosis after a 180‐baseline period (Figure S1). For instance, the first ADE diagnosis of an individual after 180‐day enrollment was considered as a case; a subsequent ADE diagnosis of an individual was considered as a case if all previous ADEs were diagnosed 180 days prior to the current diagnosis. From the MarketScan database, we identified 203,797 acute myocardial infarction cases, 295,956 acute renal failure cases, 227,755 liver injury cases, and 137,420 gastrointestinal bleeding cases. From INPC data, we identified 137,439 acute myocardial infarction cases, 165,469 acute renal failure cases, 200,956 liver injury cases, and 235,056 gastrointestinal bleeding cases. We conducted case‐control designs for all four ADEs using the two databases. We used the ADE diagnosis dates as the case index dates. We generated the control index dates by using the dynamic control selection approach, super control selection approach, and random control selection approach. Similar as the cases, individuals eligible as controls must be enrolled 180 days prior to the control index dates (Figure S1). In addition, we also applied gender and age matching for all three control selection approaches, and we fixed the case‐control ratio as 1:50. Last, we examined the drug exposure statuses within 30‐day prior to the index dates. For both databases, the drug names were normalized to generic drug names. As we did not have the authority to share the original data, we created a mock data set that had similar structures as the administrative claim data. Additionally, we provided sample codes to prepare case‐control data sets from the mock data set. Please see Supplementary Data and Codes.

Gold standard and DPA analysis

The OMOP drug‐ADE gold standard was designed to establish a reference set for a pharmacovigilance study. It included 399 drug‐ADE pairs that were based on 181 drugs and the aforementioned four ADEs (acute myocardial infarction, acute renal failure, acute liver injury, and gastrointestinal bleeding). These 399 drug‐ADE pairs were classed as 165 true positive test cases and 234 true negative test cases. For each of the drug‐ADE pair in the gold standard, the case‐control data set (i.e., all case index dates and control index dates) was summarized into a two‐by‐two contingency table by statuses of the ADE (Yes/No) and the drug (Yes/No; i.e., a, b, c, and d in the 2‐by‐2 table). In this study, we selected two frequentist DPA approaches and one Bayesian DPA approach for evaluation. They were PRR, ROR, and IC (Table 1). Additionally, we computed PRR025, ROR025, and IC025, which were the lower bounds of the 95% confidence intervals for the aforementioned quantities.

TABLE 1

Pharmacoinformatics approaches: PRR, ROR and IC (a, b, c, and d are the four counts in a two‐by‐two contingency table)

DPA	Formula	Quantity of estimation and description
PRR⁸	aa+b/cc+d	PADE\|DrugPADE\|NoDrug
ROR⁹	ab/cd	PADE\|DrugPNoADE\|Drug/PADE\|NoDrugPNoADE\|NoDrug
IC¹⁰	log2a+1a+c×a+ba+b+c+d+1	log2PADE&DrugPADE×PDrug
IC¹⁰	log2a+1a+c×a+ba+b+c+d+1	By adding 1 on both the numerator and the denominator, infrequent drug‐ADE pairs will have penalized IC values.

a = count (ADE = Yes and drug = Yes), b = count (ADE = No and drug = Yes), c = count (ADE = Yes and drug = No), and d = count (ADE = No and drug = No).

Abbreviations: IC, information component; PRR, proportional reporting ratio; ROR, reporting odds ratio.

Pharmacoinformatics approaches: PRR, ROR and IC (a, b, c, and d are the four counts in a two‐by‐two contingency table) a = count (ADE = Yes and drug = Yes), b = count (ADE = No and drug = Yes), c = count (ADE = Yes and drug = No), and d = count (ADE = No and drug = No). Abbreviations: IC, information component; PRR, proportional reporting ratio; ROR, reporting odds ratio.

RESULTS

There were 107,509,200 unique individuals in the MarketScan database. For the four ADEs, there were 864,928 distinct case index dates. We selected 20,000,000 random index dates as the control pool for random control selection approach. Subsequently, we generated case‐control data sets by using the dynamic control selection approach, super control selection approach, and random control selection approach. We computed PRR, PRR025 (i.e., the lower bound of 95% confidence interval of PRR), ROR, ROR025, IC, and IC025 for all gold standard drug‐ADE pairs. The rules to determine ADE signals for the aforementioned quantities were: PRR greater than 1, PRR025 greater than 1, ROR greater than 1, ROR025 greater than 1, IC greater than 0, and IC025 greater than 0. We computed precision, recall, and F‐score (i.e., performance metrics) using the signal detection rules and the OMOP gold standard. The performance metrics for using all drug‐ADE pairs are shown in Figure 2. First, F‐scores of random control selection approach were either close to or slightly higher than the F‐scores of dynamic/super control selection approaches. Specifically, random control selection approach had F‐scores between 0.586 and 0.600; and dynamic/super control selection approaches had F‐scores between 0.545 and 0.562. Second, the random control selection approach had higher recall values (0.854–0.961) compared to dynamic/super control selection approaches (0.720–0.804). Last, all three control selection approaches had similar precision values. Specifically, the random control selection approach had precision values between 0.436 and 0.446; and dynamic/super control selection approaches had precision values between 0.430 and 0.449. The ADE‐specific performance metrics are shown in Figure S2. In a short summary, all three control selection approaches had similar F‐scores for acute renal failure, acute liver injury, and gastrointestinal bleeding. However, the random control selection approach had higher F‐scores for acute myocardial infarction. Additionally, in the INPC data analysis, the performance metrics of the random control selection were either close to or higher than dynamic/super control selection approaches (Figures S3 and S4).

FIGURE 2

Precision, recall, and F‐score in MarketScan data analysis. IC, information component; PRR, proportional reporting ratio; ROR, reporting odds ratio

Precision, recall, and F‐score in MarketScan data analysis. IC, information component; PRR, proportional reporting ratio; ROR, reporting odds ratio Due to the random nature of the control selection process (i.e., the controls were randomly sampled), we replicated the control selection process 50 times to investigate the consistency of control selection. Using acute myocardial infarction as the ADE, we generated 50 case‐control data sets for each of the control selection approaches. Subsequently, we computed precision, recall, and F‐score; and their 95% empirical confidence intervals (i.e., the 2.5% and the 97.5% quantiles of the performance metrics). The results are shown in Figure 3. First, the random control selection approach had higher performance metrics than the dynamic/super control selection approaches. Dynamic control selection and super control selection had similar performance metrics. Second, all three control selection approaches yield consistent performance metrics with narrow empirical confidence intervals.

FIGURE 3

Performances for 50 independent replications using MarketScan data and acute myocardial infarction as ADE. ADE, adverse drug event; IC, information component; PRR, proportional reporting ratio; ROR, reporting odds ratio We also evaluated the actual computation time for all control selection approaches using the MarketScan data on our local server. Because all control selection approaches had the same initialization steps (i.e., determine ADE case index dates and enrollment periods), we only compared the computational time to generate the control index dates. We fixed the case‐control ratio as 1:50. With 107,509,200 individuals, we evaluated the computational time to generate: (i) 500 control index dates for 10 cases; (ii) 5000 control index dates for 100 cases; (iii) 50,000 control index dates for 1000 cases; and (iv) 500,000 control index dates for 10,000 cases. The computational times for 10, 100, 1000, and 10,000 cases were: (i) 0.01, 0.14, 1.41, and 13.94 h for the dynamic control selection approach; (ii) 0.01, 0.12, 1.14 and 10.86 h for the super control selection approach; and (iii) only 0.03, 0.35, 3.21, and 33.32 s for the random control selection approach (Figure 4). The random control selection approach was ~ 1000 times faster than the dynamic/super control selection approaches.

FIGURE 4

Actual and projected computation time for random control selection approach and dynamic/super control selection approach using MarketScan data

DISCUSSION

We propose a random control selection approach to conduct case‐control design‐based high‐throughput ADE screening using large‐scale longitudinal health data. Under the random control selection approach, randomly selected index dates are gathered as a random control pool, which can be used to prepare case‐control data sets for multiple ADEs. Compared with existing dynamic/super control selection approaches, the random control selection approach relaxes the matching by the case index date restriction and the matching by ADE history restriction. We evaluated the performance metrics of all control selection approaches by using a large‐scale administrative claims data set and drug‐ADE gold standard. Using precision, recall, and F‐score as metrics, we identified that the proposed random control selection approach had similar performance metrics as the dynamic/super control selection approaches in three ADEs: acute liver injury, acute renal failure, and gastrointestinal bleeding (Figure 2), and better performance metrics than the dynamic/super control selection approaches in acute myocardial infarction (Figure 3). We replicated the control selection process 50 times, and observed the random control selection approach had consistent performance metrics (Figure 3). Thus, the reproducibility of the random control selection approach has been confirmed by 50 replications (Figure 3), and another EHR data analysis (Figures S3 and S4). These results suggested that the case‐control design‐based pharmacoinformatics study using the random control selection approach was comparable to using the dynamic/super control selection approach for screening ADEs. Our primary motivation for proposing the random control selection approach is to reduce computational time for high‐throughput ADE screening using large‐scale longitudinal health data. At each case index date, the dynamic/super control selection approaches require all individuals to be evaluated for eligibility, as the control selection approaches depend on ADE/enrollment history. For instance, at a case index date, eligible dynamic controls are individuals who have not yet developed the ADE and have been enrolled over a period (Figure 1). Thus, the total computational time for dynamic/super control selection approaches is proportional to the total number of case index dates. Given the large amount of distinct case index dates in large‐scale longitudinal health data, dynamic/super control selection approaches require a considerable amount of computational time. Moreover, for investigating multiple ADEs, the selected controls using dynamic/super control selection approaches for one ADE cannot be used as controls for another ADE, as the dynamic/super control selection approaches depend on the ADE history (i.e., the controls are ADE specific). Thus, computational time to select controls is further increased for screening multiple ADEs, as multiple case‐control data sets are required. In contrast, the random control selection approach does not depend on the health history, nor depend on the ADE history. For random control selection, individuals are randomly selected to from a control pool, and control index dates are randomly selected as well. Thus, once the control pool has been formed, the total computational complexity remains fixed as the number of case index dates increases. Moreover, the random control pool can be used to generate case‐control data sets for screening multiple ADEs without additional computational time expense. For large‐scale longitudinal health data like MarketScan data (i.e., N > 40 million per year), dynamic/super control approaches required 100 h to prepare a case‐control data set for an ADE that has a rate of 0.1% (Figure 4). Thus, it required 1000 h to prepare 10 case‐control data sets for 10 ADEs with similar rates. Alternatively, the random control selection approach required only 5 min to generate the random control pool. Subsequently, control index dates were random sampled from the control pool. The total time to prepare one case‐control data set was ~ 1 min. For 10 ADEs, the random control selection approach only required 15 min (i.e., 5 min for preparing the random control pool and 10 min for selecting control index dates for 10 ADEs). We would like to point out that the primary aim for ADE screening is to prioritize true ADE signals (i.e., ADE signal ranking). Thus, bias is not a significant concern for screening ADE signals, as long as the true ADEs can be prioritized. The proposed random control selection approach reduces computational time by relaxing the matching by case index date restriction and the matching by ADE history restriction. In traditional pharmacoepidemiological study, these two restrictions are used to reduce potential biases. Matching by case index date is able to reduce potential temporal bias. Additionally, matching by ADE history restriction is able to reduce selection bias. We conducted additional simulation studies to evaluate the impact of relaxation the aforementioned restrictions (details give in Supplementary Simulation Results). Based on 5000 simulations, we observed the estimated PRR, ROR, and IC values under the random control approach were closed to the values under the dynamic/super control selection approach (relative differences less than 1%). Thus, relaxing these restrictions may not induce significant biases. In fact, the biased control selection approaches were widely used for practical or scientific reasons., For instance, the super control selection approach is a biased control selection approach. Please note that even carefully conducted case‐control design is subject to bias., Currently, the active comparator design is considered as the gold standard for accessing drug outcome. In the active comparator design, the ADE rate among patients exposed to the candidate drug is compared to the ADE rate among patients exposed to the comparator drug (i.e., a drug similar to the candidate drug). However, the active comparator design is highly computationally expensive, as it requires all drug exposures to be assessed. Thus, it is natural to first screen ADE signals by using a computationally efficient approach, and subsequently to validate the ADE signals by using a more rigorous approach. In this study, we observed: (i) the random control selection approach had similar or better performance metrics compared with the dynamic/super control selection approaches; and (ii) the random control selection approach required much less computational time. Based on these two reasons, the random control selection approach is able to accelerate ADE screening process and generate comparable ADE signals. We would like to point out that the proposed random control selection approach can reduce bias by using stratified matching. This can be accomplished by generating separate random control pools for each of the strata. Although the actual computation time for data preparation in case‐control design may depend on many factors (i.e., hardware and matching process), the proposed random control selection approach shall significantly reduce the computation time with stratified matching too. In these studies, we selected four ADEs for performance evaluation. They were acute myocardial infarction, acute renal failure cases, liver injury, and gastrointestinal bleeding. We selected these ADEs as they were in the OMOP drug‐ADE gold standard. These ADEs were also highly frequent and had significant clinical consequences (i.e., causing emergency department visit). Thus, these ADEs have been continuously monitored by the US Food and Drug Administration (FDA). Although we are not primarily focusing on the performance metrics of the DPA methods, we would like to discuss ADE screening with respect to different types of databases and ADEs. First, the performance metrics of a pharmacoinformatics study using different types of longitudinal health databases may be different. In administrative claims data analysis, the random control selection approach had F‐score ~ 0.6 (precision ~ 0.45 and recall ~ 0.90) for all four ADEs. In other words, the random control selection approach was able to select ~ 90% true ADEs. In EHR data analysis, the random control selection approach had F‐score ~ 0.5 (precision ~ 0.45 and recall ~ 0.40) for all four ADEs. Administrative claims data and EHR data had different informatics structure due to their nature structures. Thus, an algorithm to identify an ADE may perform differently in these two types of data. Subsequently, the performance metrics of pharmacoinformatic studies may differ. Second, we identified that the performance metrics among four ADEs were different. In administrative claims data analysis, acute liver injury had F‐score ~ 0.8; whereas acute myocardial infarction, acute renal failure, and gastrointestinal bleeding had F‐scores between 0.4 and 0.6. The performance metrics of case‐control design‐based pharmacoinformatics study depend on the length of the window to examine drug exposure (i.e., drug exposure window). In our study, we used 1 month drug exposure window for all four ADEs. However, the 1 month window may not be the best window for all ADEs. For a high‐throughput pharmacoinformatics study of multiple ADEs, an ADE‐specific drug exposure window can be used. The duration of the drug exposure window can be determined by clinical knowledge or sensitivity analysis. Additionally, the performance metrics of a pharmacoinformatic study also depend on confounding control., Longitudinal health data contain a variety of variables, including both categorical and numerical confounders. Both categorical confounders and numerical confounders can be controlled by using multivariable analysis. Additionally, confounders can be controlled in the case‐control design by using stratified case‐control matching (i.e., matching by gender or dichotomized age). The performance of a pharmacoinformatic study also depends on the quality of phenotyping algorithms. Currently, ADE phenotyping is a fast growing field. We expect to see more accurate ADE phenotyping algorithms in the near future. The scope of this work is to identify a computational efficient control selection algorithm for screening ADE signals from large‐scale longitudinal health database. Currently, large‐scale longitudinal health databases and ADE phenotyping algorithms become increasingly available. Using the random control selection approach, a case‐control design‐based pharmacoinformatic study can be upscaled to screen a tremendous amount of hypotheses without spending a tremendous amount of time, and identify single drugs and drug combinations with increased ADE risks. The random control selection approach is able to accelerate the subsequent validation studies as well. Ultimately, high‐throughput adverse drug events screening using large‐scale longitudinal health data will find better ways in promoting health. One limitation of this study is that the association between drug dosage and ADE was not investigated. Although dosage information is available in the MarketScan database and INPC database, its utilization requires sophisticated text mining algorithms. Another limitation is that whereas 1:4 to 1:10 case‐control ratios were used in the current study, the optimal case‐control ratio for ADE screening, remains unclear for large‐scale health data mining. With large‐scale longitudinal health data, a case‐control design‐based pharmacoinformatic study using randomly selected index dates as controls (i.e., random control selection approach) has similar or higher performance metrics compared with existing control selection approaches. Compared with existing control selection approaches, the random control selection approach is able to significantly reduce the time to prepare the case‐control data sets.

CONFLICT OF INTEREST

The authors declared no competing interests for this work.

AUTHORS CONTRIBUTIONS

All authors wrote the manuscript. L.L., P.Z., and M.D. designed the research. C.C. and P.Z. performed the research. C.C., P.Z., Y.S. and Y.C. analyzed the data. Supplementary Material Click here for additional data file. Supplementary Material Click here for additional data file. Supplementary Material Click here for additional data file.

34 in total

1. Evaluation of alternative standardized terminologies for medical conditions within a network of observational healthcare databases.

Authors: Christian Reich; Patrick B Ryan; Paul E Stang; Mitra Rocca
Journal: J Biomed Inform Date: 2012-06-07 Impact factor: 6.317

2. National Hospital Discharge Survey: 2007 summary.

Authors: Margaret Jean Hall; Carol J DeFrances; Sonja N Williams; Aleksandr Golosinskiy; Alexander Schwartzman
Journal: Natl Health Stat Report Date: 2010-10-26

3. The FDA's sentinel initiative--A comprehensive approach to medical product surveillance.

Authors: R Ball; M Robb; S A Anderson; G Dal Pan
Journal: Clin Pharmacol Ther Date: 2016-01-12 Impact factor: 6.875

4. More on "Biased selection of controls for case-control analyses of cohort studies".

Authors: J M Robins; M H Gail; J H Lubin
Journal: Biometrics Date: 1986-06 Impact factor: 2.571

5. Biased selection of controls for case-control analyses of cohort studies.

Authors: J H Lubin; M H Gail
Journal: Biometrics Date: 1984-03 Impact factor: 2.571

6. Performance of pharmacovigilance signal-detection algorithms for the FDA adverse event reporting system.

Authors: R Harpaz; W DuMouchel; P LePendu; A Bauer-Mehren; P Ryan; N H Shah
Journal: Clin Pharmacol Ther Date: 2013-02-11 Impact factor: 6.875

7. Postmarket Safety Events Among Novel Therapeutics Approved by the US Food and Drug Administration Between 2001 and 2010.

Authors: Nicholas S Downing; Nilay D Shah; Jenerius A Aminawung; Alison M Pease; Jean-David Zeitoun; Harlan M Krumholz; Joseph S Ross
Journal: JAMA Date: 2017-05-09 Impact factor: 56.272

1. Random control selection for conducting high-throughput adverse drug events screening using large-scale longitudinal health data.

Authors: Chien-Wei Chiang; Penyue Zhang; Macarius Donneyong; You Chen; Yu Su; Lang Li
Journal: CPT Pharmacometrics Syst Pharmacol Date: 2021-08-17

1 in total