Literature DB >> 34568771

Quantifying representativeness in randomized clinical trials using machine learning fairness metrics.

Miao Qi¹, Owen Cahan², Morgan A Foreman³, Daniel M Gruen², Amar K Das³, Kristin P Bennett^1,2.

Abstract

OBJECTIVE: We help identify subpopulations underrepresented in randomized clinical trials (RCTs) cohorts with respect to national, community-based or health system target populations by formulating population representativeness of RCTs as a machine learning (ML) fairness problem, deriving new representation metrics, and deploying them in easy-to-understand interactive visualization tools.
MATERIALS AND METHODS: We represent RCT cohort enrollment as random binary classification fairness problems, and then show how ML fairness metrics based on enrollment fraction can be efficiently calculated using easily computed rates of subpopulations in RCT cohorts and target populations. We propose standardized versions of these metrics and deploy them in an interactive tool to analyze 3 RCTs with respect to type 2 diabetes and hypertension target populations in the National Health and Nutrition Examination Survey.
RESULTS: We demonstrate how the proposed metrics and associated statistics enable users to rapidly examine representativeness of all subpopulations in the RCT defined by a set of categorical traits (eg, gender, race, ethnicity, smoking status, and blood pressure) with respect to target populations. DISCUSSION: The normalized metrics provide an intuitive standardized scale for evaluating representation across subgroups, which may have vastly different enrollment fractions and rates in RCT study cohorts. The metrics are beneficial complements to other approaches (eg, enrollment fractions) used to identify generalizability and health equity of RCTs.
CONCLUSION: By quantifying the gaps between RCT and target populations, the proposed methods can support generalizability evaluation of existing RCT cohorts. The interactive visualization tool can be readily applied to identified underrepresented subgroups with respect to any desired source or target populations.

Entities: Chemical

Keywords: health equity; machine learning; population representativeness; randomized clinical trials; subgroup

Year: 2021 PMID： 34568771 PMCID： PMC8460438 DOI： 10.1093/jamiaopen/ooab077

Source DB: PubMed Journal: JAMIA Open ISSN： 2574-2531

BACKGROUND AND SIGNIFICANCE

Inequitable representation and evaluation of diverse subgroups in randomized clinical trials (RCTs) and other clinical research may generate unfair and avoidable differences in population health outcomes. In an analysis of trials conducted by Pfizer between 2011 and 2020, scientists found an urgent need for solutions to enhance diverse representation across all populations within clinical research. Similarly, health inequity attracted great public attention during the COVID-19 pandemic. For example, race and ethnicity are identified factors associated with risk for COVID-19 infection and mortality. Representative enrollment of participants with diverse race and ethnicity is required in clinical trials to ensure valid treatment effect conclusions and to support reliable generalizability of clinical trial results across subpopulations. A well-designed RCT is considered the most reliable way to estimate cause–effect relationships between treatments and outcomes., The randomization process, which makes RCTs gold standards of treatment effectiveness, contains 2 randomization processes, the random sampling from source population to trial cohort and the random assignment from trial cohort to different experimental groups., The random sampling is critical to the applicability and generalizability of clinical findings but has received much less attention than random assignment. Figure 1 demonstrates that if a latent patient trait guides the patient enrollment into the study and affects the outcome, then the study generalizability to other reference populations may be limited from a causal inference perspective.

Figure 1.

The causal models of truly randomized clinical trials and biased randomized clinical trials. X represents the subject covariates; Y is the sampling of a subject to a trial; T indicates the treatment; and Z is the outcome. The black arrows represent causal dependencies between variables. A. In the causal model for truly randomized clinical trials, no dependency should exist between X and Y. Thus, the observed probability of outcome Z given the treatment is a good estimate of whether the treatment causes the outcome. B. In the causal model for biased randomized clinical trials, an arrow exists between X and Y′, which indicates the dependence. Thus, invalid causal inferences may be estimated for treatment efficacy among some subpopulations and result in unfair and avoidable population health disparities.

Population representativeness and previous works

We define RCT representativeness as the similarity between an RCT cohort and an investigator-defined target population with the specific goal of understanding the representation differences within subpopulations. The target population for an RCT may be different from the population of all individuals who have a particular health condition. For example, Pradhan et al reported that the level of trial representativeness changes if the target population shifts from patients with type 2 diabetes who are eligible to receive liraglutide to all patients with type 2 diabetes, since the potential subjects become younger and are less likely to have comorbidities. Thus, the first step is to let investigators define the target population based on an appropriate real-world data source, such as an Electronic Health Record (EHR) system, or a nationally representative population sample, such as the National Health and Nutrition Examination Survey (NHANES). Our goal is to calculate representation metrics for all possible subgroups created by the multiple traits and then focus on visualizations and statistical methods that enable users to effectively identify significantly underrepresented subgroups with respect to the target population. Our work complements currently available measurements of trial representativeness. For instance, sGIST, mGIST, GIST, and GIST 2.0 are a series of a priori generalizability method that can calculate generalizability scores on multiple traits across multiple clinical trials with explicitly consideration on eligibility criteria dependences. These metrics help researchers identify underrepresented subgroups due to eligibility requirements and can thus be used to inform eligibility criteria in trial design. Our current tool focuses on a posteriori evaluation of representativeness, and the analyses presented here deal with simple eligibility criteria such as age over 50 or without diabetes. Other complex eligibility criteria including trait dependencies are left as future work to incorporate GIST 2.0 into our framework. Our general methodology for calculating and visualizing subgroup representativeness and their statistical significance could also be combined with existing methods for comparing characteristic distributions between study samples and target populations. These include basic metrics such as the difference or ratio of subgroup proportions in RCT cohorts and target populations and propensity score methods. These basic metrics are important indicators of population representativeness. But extending them to handle high-dimensional data or small-size subgroups can be challenging since the misrepresentation may be statistically insignificant and hard to detect in visualizations. Our multi-faceted assessment framework to evaluate diversity, inclusion, and equity provides a comprehensive and interpretable subpopulation-level understanding of population representativeness of RCTs. Our a posteriori metrics have defined a significant difference threshold and equity thresholds supported by well-developed guidelines. We show that sunburst visualizations can explicitly present the influence of different variables over the others thus adding more valuable insights to the approach. By indicating the representativeness of all possible subgroups, our approach could eventually help illuminate the “black box” of sample selection and trial generalizability in clinical trials.

Machine learning fairness and previous works

Machine learning (ML) fairness metrics have been developed to quantify and mitigate bias in ML and artificial intelligence (AI) models. To improve the performance of existing RCT representativeness measurements, we consider sampling to the RCT a random binary classification problem and develop standardized metrics for RCTs based on variations of ML fairness metrics by mapping to the context of RCTs. ML fairness metrics quantify potential bias toward protected groups in trained ML classification model outcomes. Our metrics, instead of comparing positive and negative classes based on model outcomes, focus on the trial-subject data generation process within the RCT. Our novel insight is to regard subject sampling to an RCT as a classification function that is random and then create variants of ML fairness metrics. Our metrics capture how well the actual enrollment of subjects to an RCT cohort matches a truly random sampling. The statistical properties of the hypothetical random sampling from a target population can be estimated using nationally representative datasets or clinical databases of individual characteristics, such as NHANES or from EHRs.

Consolidated standards of reporting trials and previous works

Our main goal is to identify all subgroups that are not well represented in RCT study samples in order to understand generalizability with respect to a target population. Our method augments the Consolidated Standards of Reporting Trials (CONSORT), statement and its extension CONSORT-Equity, which aims to avoid biased results from incomplete or nontransparent research reports that could mislead decision-making in healthcare. By appropriately defining the target population (such as all individuals with the health condition or those clinically defined by eligibility criteria), our metrics and visualization can support incorporating representativeness evaluation before, during, and after any RCTs. Additionally, they can help an Institutional Review Board (IRB) or funding agency evaluate the equity in trial-design stages and assist government regulators to ensure a fair distribution of clinical benefits from a study to the general population. Our proposed representativeness metrics are expected to identify subgroups that are insufficiently recruited into and represented in the clinical trial cohort using study summary data only, ensuring privacy, security, and confidentiality of health information. These metrics can then be used by clinicians, clinical researchers, and health policy advocates to assess potential gaps in the applicability of clinical trials in real-world settings.

Our contributions

The contributions discussed in this paper are (1) formulating the problem of representativeness evaluation in RCTs as a comparison between a truly random sampling function in a target population and the actual sampling observed in the clinical trial cohort; (2) deriving new metrics for representativeness of RCT cohorts based on ML fairness metrics; (3) utilizing proposed metrics to measure subject representation of RCT cohorts with respect to a target population; (4) identifying needs, gaps, and barriers of equitable representation of various subgroups in RCT cohorts; (5) designing a tool (an R Shiny App) to automatically evaluate trial representativeness through on-demand subject stratification and distribute reports containing visualizations and explanations for different users.

METHODS AND MATERIALS

We establish a general mapping from RCT to ML fairness and then derive metrics to evaluate the population representation of RCT cohorts based on ML fairness measures. We provide a visual representation of results with associated statistical tests to transparently communicate the quantitative results to diverse user groups. Table 1 provides a glossary of fairness and representativeness terms used throughout the manuscript.

Table 1.

Glossary

Term	Definition	Example(s)
Target population	The group of people that investigators defined to be compared with the RCT cohort	US population with hypertension as defined in NHANES
Subgroup	Subset of target population that share single or multiple common baseline attribute values and thus can be distinguished from the rest	Non-Hispanic black female subjects; non-Hispanic white male subjects
Ideal rate	Proportion of subjects in a subgroup in the target population	Proportion of female subjects among those with hypertension in United States
Observed rate	Proportion of subjects in a subgroup in the RCT	Proportion of female subjects in SPRINT study
Representativeness	The similarity between an RCT sample and its target population distributions
Protected attribute	Attributes that classify the population of a specific disease into groups that have parity in terms of health outcomes received	Age, BMI, total cholesterol
Representativeness metric	Function of disease-specific observed and ideal rates of sampling of protected subgroups to the RCT	Log disparity

Abbreviations: BMI: body mass index; NHANES: National Health and Nutrition Examination Survey; RCT: randomized clinical trial; SPRINT: Systolic Blood Pressure Intervention Trial.

Glossary Abbreviations: BMI: body mass index; NHANES: National Health and Nutrition Examination Survey; RCT: randomized clinical trial; SPRINT: Systolic Blood Pressure Intervention Trial.

RCT representativeness and ML fairness

In an ML prediction model, given a feature vector of subject from distribution , a binary classifier predicts if the subject is positive () or negative (). The true outcome is . We define RCT representativeness as how well the RCT cohort represents a random sampling of subjects from the specified target distribution. The target distribution can be defined based on analysis goals, for example, eligibility criteria could be considered if appropriate. In RCTs, the feature vector is the protected attributes or subject traits; the binary classifier assigns subjects into the study cohort, where means a subject is recruited while means not recruited. is the true random sampling result of the subject into the study from the target population. For RCT representativeness evaluation, each individual in the target population is defined by where represents the protected attributes, represents the unprotected attributes, and is the ideal sampling of the individual by an RCT. An ideal RCT enrolls subjects i.i.d. from the target population . The RCT enrollment strategy can be treated as a binary classifier , denoting the real observed decision induced by on an individual . The subgroups are defined via a family of indicator functions . For each , means that an individual with protected attributes is in the subgroup. For this study, we utilize protected attributes of 3 types: demographic characteristics, risk factors, and laboratory results. Here, risk factors are any study-specific covariates defined in the Table 1s of clinical trial publications relevant to the study besides demographic characteristics. The selected variables were both relevant to the study and available in the NHANES data to estimate the target distribution. Any available categorical attributes representation of the target and cohort populations could be used. ML fairness metrics are concerned with guaranteeing similarity results across different subgroups. We assume that the ideal RCT achieves statistical parity, that is, subgroups are independent of outcomes (Then we create metrics based on ML fairness measures of statistical parity violations. The proposed metrics also assume that the ideal sampling of a subject to the RCT and the observed sample are independent and the sizes and the rates of an ideal RCT and the observed trial are the same ). The ideal and observed rates of a subgroup are and , respectively. The enrollment fraction of a subgroup is We note by independence assumptions of ideal RCT,

Log disparity metric for RCT

In ML fairness, the disparate impact measure is the ratio of positive rates of both protected and unprotected groups: Disparate impact adopts the “80 percent rule” suggested by the US Equal Employment Opportunity Commission to decide when the result is unfair: The “80 percent rule” requires the selection rate of a subgroup to be at least 80% of the selection rate of the other subgroups. As shown in the following theorem, when applied to the RCT, disparate impact reduces to an intuitive quantity based on the enrollment odds of a protected group and in the target. RCT version of Disparate Impact Metric Based on the ideal RCT assumptions above, the disparate impact metric is equivalent to the ratio of enrollment odds of subjects of the protected group in the observed cohort to the odds of protected subjects in the ideal cohort: See Supplementary Materials for proof. Since log odds provide advantages for ease of understanding, we propose the following metric for RCT. The Log Disparity metric for measuring how representative of subgroup in observed trial as compared to ideal population is In the log disparity metric, a value of 0 indicates perfect clinical equity. A value smaller than the lower threshold, , implies a potential underrepresentation of a subgroup while a value greater than implies a potential overrepresentation. We further add an upper threshold, . A value less than implies highly underrepresentation; similarly, a value greater than implies highly overrepresentation. Values between and mean equitable representation. Our metric thresholds are selected based on guidance from literature,, but other optimal thresholds under different criteria are allowed as inputs. We use a significance level of 0.05, a lower threshold of −log (0.8), and an upper threshold of −log (0.6).

Normalized parity metric

The ML fairness Equal Opportunity metric which requires subgroups to have the same true positive rates can also be applied to RCTs. RCT version of Equal Opportunity Metric Let ideal RCT assumptions hold and be binomial random variable, then the ML fairness Equal Opportunity metric has the following equivalent form: See Supplementary Materials for proof. The proportion of population in the trial, , is extremely small and not very meaningful, thus we propose a new metric. The Normalized Parity metric measures the difference in rates of protected group in the trial and in the population scaled by the variance of the protected group in the target population. The Normalized Parity metric for measuring how representative of subgroup in observed trial as compared to ideal population The proposed Log Disparity and Normalized Parity metrics have several nice properties. They are easy to compute. The observed rates of each subgroup, , are estimated from trial data. The ideal rates and variance, and , are estimated for the desired target population using surveillance datasets such as NHANES or electronic medical records (EMRs). The required estimates are robust to missing data. Individual privacy can be protected since only summary statistics are required for the proposed metrics, avoiding the pitfalls of alternative metrics requiring per subject calculations. Both metrics have a common interpretation for subgroups with very different background rates: 0 means that demographic parity holds, <0 means subgroup is underrepresented, and >0 means subgroup is overrepresented. Statistical tests quantify the significance of observed disparities for each subgroup which take into account the RCT study size and estimation errors of the ideal assignment rate. We use a one-proportion two-tailed z-test to determine whether the observed rate is significantly deviated from the ideal population rate. We use Benjamini–Hochberg to correct for multiple comparisons across all subgroups. If the difference between observed and ideal rates is not statistically significant, the subgroup is treated as representative; otherwise, we will use metrics to quantify the subgroup representativeness. Other statistical tests could be used. See Supplementary Material for details. Log disparity and normalized parity are both monotonically increasing functions of the observed rate for a subgroup scaled by the target rate. Log disparity offers some advantages when examining rare subgroups because it is a nonlinear function while normalized parity is a linear function, as discussed in the Supplementary Material (section: Log Disparity vs Normalized Parity). Thus, we focus on log disparity results. All Normalized Parity results are available in the supplement visualization tool.

RCT trial data

We assess the proposed methodologies on 3 real-world RCTs: Action to Control Cardiovascular Risk in Diabetes (ACCORD), Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial (ALLHAT), and Systolic Blood Pressure Intervention Trial (SPRINT) in BioLINCC with the ideal subgroup assignment rate calculated from individuals with matched disease conditions in NHANES. According to participants’ baseline characteristics typically summarized in Table 1s of clinical trial reports, we selected 9 protected attributes. We categorize continuous variables based on the CDC (Centers for Disease Control and Prevention)-approved standards. Subject data obtained from RCTs are mapped to the existing NHANES categories. The protected attributes examined here are (1) demographic characteristics (gender, race/ethnicity, age, and education); (2) baseline risk factors [smoking status, body mass index, and systolic blood pressure (SBP)]; and (3) baseline laboratory test results [fasting glucose (FG) and total cholesterol (TC)]. The observed rates of the subgroup are calculated from the RCT data For each study, we construct all possible subgroups that can be instantiated as We define 29 univariate, 109 bivariate, and 306 multivariate subgroups based on 9 protected attributes. In general, any baseline subject attributes can be selected as protected attributes in our approach.

Target population

In our experiment, we sought to evaluate how well the studies represented overall diabetic and hypertensive populations in United States as characterized by NHANES. The ideal rates from target populations are calculated from NHANES 2015–2016 using the R survey() package which accounts for potential bias from complex survey designs. The NHANES population selected varies based on study objectives and desired target population. To evaluate ACCORD, we estimate ideal rates of subgroups of diabetic individuals in the United States using subjects who report having diabetes in NHANES, and we use subjects who report having hypertension in NHANES as the target population to evaluate ALLHAT and SPRINT. These criteria could be modified to consider study inclusion and exclusion criteria depending on the goals of analysis. Since users may have better target population data that match their studies, user-provided target population datasets and multiple target files are allowed. For example, clinicians who focus on their local communities could use the community or health system population as the target to evaluate the equity of RCTs, whereas researchers who work on a global disease, the target population may be better estimated from global population datasets.

RESULTS

To demonstrate the proposed metric, we created a visualization using different colors to represent different representativeness levels in RCTs. For compact presentation, we focus on the log disparity metric. Figure 2 illustrates how the log disparity function applies to relative common subgroups Female and Female Non-Hispanic Black in ACCORD.

Figure 2.

The shift of representativeness distribution of Log Disparity metric for different patient subgroups with type 2 diabetes in Action to Control Cardiovascular Risk in Diabetes. The green line corresponds to the ideal rate for the subgroup determined from National Health and Nutrition Examination Survey. The brown line indicates the rate actually observed. A. Log Disparity as function of observed rate for female subgroup. B. Log Disparity as function of observed rate for female non-Hispanic black subgroup. As shown in Figure 2A, for women with type 2 diabetes, the ideal rate from NHANES is 0.445 while the observed RCT rate is 0.386. The observed female-subject rate falls into the light orange region, which reveals the underrepresentation of female subjects. For Figure 2B, when the subgroup of interest is changed to non-Hispanic black female participants, the ideal rate decreases to 0.079 and the observed rate becomes 0.095. Now the interested subgroup falls into the teal region, which means that non-Hispanic black female participants are equitably represented in ACCORD. This indicates the influence of protected attribute race/ethnicity on the representativeness evaluation. By comparing Figure 2A and B, we can observe that metric functions change as the ideal rate changes. The representativeness of 29 univariate subgroups for 3 RCTs are shown in Figures 3 and 4. Dark red represents the subgroups absent from the RCT; light orange and orange indicate that subgroups are underrepresented or highly underrepresented in the RCT relative to the target population; light blue and blue specify the potentially overrepresented or highly overrepresented subgroups; teal shows the subgroup is either equitably represented or has no significant difference; dark gray indicates that no individuals with selected protected attributes exist in estimated target population; light grey indicates absent subgroup in both estimated target population and RCT.

Figure 3.

Figure 4.

Representativeness of subgroups defined by a single protected attribute using Log Disparity for 3 real-world randomized clinical trials. Subgroups are defined by clinical characteristics. Systolic blood pressure unit = mm Hg; Fasting glucose unit = mmol/L.

Representativeness of subgroups defined by a single protected attribute using Log Disparity for 3 real-world randomized clinical trials (RCTs). Subgroups are defined by demographic characteristics. Teal cells with a star indicate that no statistically significant difference between subgroups from the RCT and target population. Ages are in years. Abbreviations: C: cohort; TP: target population. Representativeness of subgroups defined by a single protected attribute using Log Disparity for 3 real-world randomized clinical trials. Subgroups are defined by clinical characteristics. Systolic blood pressure unit = mm Hg; Fasting glucose unit = mmol/L. We evaluate our ideal estimates for ACCORD, ALLHAT, and SPRINT using prior literature. For example, an estimated probability of female patients among US hypertensive population in 2015, calculated through Bayes’ formula, is about 47%. Comparing to the summary statistics in published literature (ie, about 47% subjects are women in ALLHAT and 36% subjects are women in SPRINT). ALLHAT captures the gender distribution among real-world hypertensive participants while SPRINT fails to enroll enough female participants. The color change across categories of an attribute highlights interesting trends in subject representation. Among 3 studies, only 2 attributes achieved equitable representation across all subgroups: gender in ALLHAT and TC in SPRINT. From Figures 3 and 4, we observe that current smokers, young participants, non-Hispanic Asian subjects, subjects with SBP under 130 mm Hg or FG between 5.6 and 6.9 mmol/L are frequently underrepresented. This indicates that some subgroups in the target population are missing or inadequately represented in the RCTs. The decision-making on a subject, for example, aged 40, based on the SPRINT study would require additional evidence beyond this study. Also, participants with lower education levels tend to be more underrepresented in the SPRINT while participants with higher education levels tend to be more underrepresented in the ALLHAT. This points out that potential social determinant confounders may exist in the RCT. We note, across all 3 studies, non-Hispanic black participants are overrepresented, perhaps reflecting efforts to ensure minority participation or reflecting study locations. In both hypertension RCTs, Asian subjects may have been insufficiently enrolled. This underrepresentation may also reflect study choices or locations. These trends have to be validated by analysis on more RCTs. For subgroups defined by multiple attributes, sunburst plots better visualize the change of subgroup representation by adding additional protected attributes, as shown in Figure 5. For each type of protected attributes (ie, demographic characteristics, risk factors, and lab results), separate sunburst charts are generated since their matched population from NHANES are different.

Figure 5.

Representativeness results measured by Log Disparity. A. Color code of representativeness levels. B. Representativeness of Action to Control Cardiovascular Risk in Diabetes randomized clinical trial (RCT) subgroups in sunburst plot with inner to outer rings defined by demographic characteristics gender, age, race/ethnicity, and education level, respectively. C. Representativeness of Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial RCT subgroups in sunburst plot with inner to outer rings defined by risk factors systolic blood pressure, body mass index, and smoking status, respectively. D. Representativeness of Systolic Blood Pressure Intervention Trial RCT subgroups in sunburst plot with inner to outer rings defined by lab results total cholesterol and fasting glucose, respectively.

Figure 5 demonstrates log disparity results for ACCORD on demographic characteristics, ALLHAT on risk factors, and SPRINT on lab results. The interactive sunburst diagram enables users to investigate many subgroups simultaneously to identify missing or underrepresented subgroups in RCTs and NHANES. For example, young female subjects aged under 45 are missing entirely. As shown in Figure 5D, with an additional attribute FG, new subgroups such as participants with glucose ≥7 mmol/L are highly underrepresented for both high and normal TC. This indicates the importance of multivariable subgroup analyses in representativeness. Note that underrepresentativeness may be due to legitimate choices in the study inclusion and exclusion criteria. If desired by the user, absent subgroups in NHANES or any target populations can be estimated using smoothing techniques. Representativeness results measured by Log Disparity. A. Color code of representativeness levels. B. Representativeness of Action to Control Cardiovascular Risk in Diabetes randomized clinical trial (RCT) subgroups in sunburst plot with inner to outer rings defined by demographic characteristics gender, age, race/ethnicity, and education level, respectively. C. Representativeness of Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial RCT subgroups in sunburst plot with inner to outer rings defined by risk factors systolic blood pressure, body mass index, and smoking status, respectively. D. Representativeness of Systolic Blood Pressure Intervention Trial RCT subgroups in sunburst plot with inner to outer rings defined by lab results total cholesterol and fasting glucose, respectively. The sunburst plots explicitly address diversity, equity, and inclusion of clinical studies with respect to the target population. For instance, Figure 5B identifies the missing evidence in subgroups including any female and non-Hispanic white male subjects aged under 45. This lack of subject diversity may lead to similar results as shown for the effectiveness of Actemra on COVID-19 patients, in which the study results flipped after including more marginalized participants. Furthermore, our visualization automatically checks if the inclusion and exclusion criteria are met. Based on the criteria of SPRINT, it successfully excluded subjects with SBP under 130 mm Hg but subjects with potential impaired glucose or diabetes still existed based on the lab results.

DISCUSSION

An advantage of the proposed metrics is they provide a standardized scale for judging trial representativeness for subgroups with vastly different expected rates in the trial; for example, the estimated ideal rate of participation in the type 2 diabetes trial estimated from NHANES for subgroups of female subjects, female subjects aged over 64, Hispanic female subjects aged over 64, and Hispanic female subjects aged over 64 with high school degree are 0.445, 0.172, 0.025, and 0.006, respectively. Evaluating differences between simple rates for many subpopulations would be more challenging. To facilitate visualizations of measured performance on clinical trials, we have incorporated a comprehensive set of fairness metrics into our prototype representativeness visualization tool using R shiny to enable researchers and clinicians to rapidly visualize and assess all potential misrepresentation in a given RCT for all possible subgroups. In our application, the number and order of the attributes for the sunburst can be changed by users; for example, instead of Figure 5B, users can visualize representativeness of subgroups for Age with further divisions by Gender and then Race/Ethnicity. With these metrics, users can rapidly determine underrepresentation of subgroups which can serve as basis for determining any limitations of the RCT. The metrics and visualizations can potentially help support evaluation of representativeness of existing RCT cohorts, design of new RCTs, and monitoring of enrollment in ongoing RCTs. The visualization may also help healthcare providers quickly understand the applicability of RCT results to a patient in a subgroup. Clinical trials are a key component of health equity. In the context of trial equity, underrepresentation or exclusion of disadvantaged participants may reduce opportunities to live healthy lives. Our metrics can also be applied to many types of clinical research and representativeness problems by appropriately adjusting the target population statistics based on the population of interest. Besides use with RCTs, these metrics can be easily modified to assess and visualize any disparities related to health including the distribution of medical care and different levels of living and working conditions for patients if the matching background information is available to obtain the ideal rate of each subgroup. Furthermore, our approach can be used as a frame of reference to guide the clinicians and policy-makers to make decisions with legitimate reasons and evidence. We offer user selections to dynamically control different conditions including subgroup characteristics, metric types, metric cutoffs, under which the users will make their own decisions. The technical challenges we encountered include determining how to appropriately treat continuous variables such as age and consider inclusion and exclusion criteria when mapping RCT cohorts and NHANES-based target population. Currently, we discretize all continuous variables, with alternative approaches, such as using expected values of numerical variables and other methods applied to ML framework, left as future work. It may be desirable to further refine the target populations to adjust for missing and underrepresented subgroups due to RCT inclusion and exclusion criteria. Due to limitations in the types of information gathered in NHANES, we could not apply all eligibility criteria used in the ALLHAT, ACCORD, and SPRINT studies to define respective clinical populations for our analyses. We plan to validate our metrics by applying them to more trials and compare results with other metrics such as GIST 2.0. It can also be useful to create a method combining the proposed metrics with GIST to enable detailed subpopulation analyses of inclusion and exclusion criteria and analysis of multiple trials. Using appropriate defining target populations with eligibility criteria, these approaches can be extended to make equitable single-/multi-site enrollment planning and monitor the enrollment process to optimize the representativeness of participants a priori and throughout the process.

CONCLUSION

Quantifying representation is important for scientific rigor and to build true equity into research designs and methods. Health equity is not just a clinical issue; it is a socioeconomic concern with broad consequences. We developed metrics and methods to evaluate how equitably subgroups are represented in RCTs. Unlike most existing studies which focus on one protected attribute each time (eg, race) for a single disease (eg, type 2 diabetes), our proposed approach can analyze clinical trials designed for several diseases such as hypertension and type 2 diabetes, simultaneously and can additionally report representativeness of subgroups defined by multiple attributes including age and race/ethnicity. Our next steps are to utilize these metrics to monitor existing RCTs, help design new RCTs, and provide tools to disseminate findings to a variety of stakeholders and user groups, including patients, clinicians, data scientists, and policy-makers, who will bring the discoveries into play to advance health equity.

ETHICAL APPROVAL

This manuscript was prepared using ACCORD, ALLHAT, and SPRINT Research Materials obtained from the NHLBI Biologic Specimen and Data Repository Information Coordinating Center and does not necessarily reflect the opinions or views of the ACCORD, ALLHAT, SPRINT, or the NHLBI. All methods were carried out following the NHLBI approved research plan: Equity in Clinical Trials, and all procedures were carried out in accordance with the applicable guidelines and regulations from NHLBI Research Materials Distribution Agreement. The procedures were approved by The Rensselaer IRB as IRB Review Not Required. Informed consent was obtained from all subjects by NHLBI. Data from research participants who refused to permit the sharing of their data are deleted from the repository dataset.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online.

FUNDING

This work was primarily funded by IBM Research AI Horizons Network. All authors were supported by IBM. KPB, MQ, DMG, and OC were also supported by Rensselaer Institute for Data Exploration and Applications. KPB and OC were also supported by United Health Foundation.

AUTHOR CONTRIBUTIONS

KPB, AKD, DMG, and MAF designed and directed the project. MQ, KPB, and OC designed the model and the computational framework. MQ performed the experiments, analyzed the results, built the application, and wrote the manuscript in consultation with KPB, AKD, DMG, and MAF. All authors reviewed the manuscript.

CONFLICT OF INTEREST STATEMENT

None declared

DATA AVAILABILITY

The example ideal national patient data are calculated from the National Health and Nutrition Examination Survey (NHANES) 2015–2016 conducted by the National Center for Health Statistics (NCHS). The clinical trial data that support the findings of this study are available from Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC) but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are, however, available with permission of BioLINCC. The data generated during and analyzed during the current study are available in the GitHub repository, https://github.com/TheRensselaerIDEA/ClinicalTrialEquity, and via the Dryad Digital Repository at https://doi.org/10.5061/dryad.76hdr7sxf. Click here for additional data file.

39 in total

1. Economic Dimensions of Health Inequities: The Role of Implementation Research.

Authors: Michael M Engelgau; Ping Zhang; Stephen Jan; Ajay Mahal
Journal: Ethn Dis Date: 2019-02-21 Impact factor: 1.847

2. Using Real-World Data to Rationalize Clinical Trials Eligibility Criteria Design: A Case Study of Alzheimer's Disease Trials.

Authors: Qian Li; Yi Guo; Zhe He; Hansi Zhang; Thomas J George; Jiang Bian
Journal: AMIA Annu Symp Proc Date: 2021-01-25

Review 3. COVID-19 and the Widening Gap in Health Inequity.

Authors: Helene J Krouse
Journal: Otolaryngol Head Neck Surg Date: 2020-05-05 Impact factor: 3.497

4. COVID-19 disparities: An urgent call for race reporting and representation in clinical research.

Authors: Hala T Borno; Sylvia Zhang; Scarlett Gomez
Journal: Contemp Clin Trials Commun Date: 2020-07-30

5. Mortality and morbidity during and after Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial: results by sex.

Authors: Suzanne Oparil; Barry R Davis; William C Cushman; Charles E Ford; Curt D Furberg; Gabriel B Habib; L Julian Haywood; Karen Margolis; Jeffrey L Probstfield; Paul K Whelton; Jackson T Wright
Journal: Hypertension Date: 2013-03-25 Impact factor: 10.190

6. Assessing the generalizability of randomized trial results to target populations.

Authors: Elizabeth A Stuart; Catherine P Bradshaw; Philip J Leaf
Journal: Prev Sci Date: 2015-04

7. Which Patients Does the SPRINT Study Not Apply To and What Are the Appropriate Blood Pressure Goals in These Populations?

Authors: Debbie L Cohen; Raymond R Townsend
Journal: J Clin Hypertens (Greenwich) Date: 2016-01-06 Impact factor: 3.738

8. Demographic diversity of participants in Pfizer sponsored clinical trials in the United States.

Authors: Melinda Rottas; Peter Thadeio; Rachel Simons; Raven Houck; David Gruben; David Keller; David Scholfield; Koshika Soma; Brian Corrigan; Annette Schettino; Patrick J McCann; Marie-Pierre Hellio; Kannan Natarajan; Rob Goodwin; Judy Sewards; Peter Honig; Rod MacKenzie
Journal: Contemp Clin Trials Date: 2021-04-30 Impact factor: 2.226

9. Rethinking COVID-19 Vulnerability: A Call for LGTBQ+ Im/migrant Health Equity in the United States During and After a Pandemic.

Authors: Nolan S Kline
Journal: Health Equity Date: 2020-05-27

10. Disparities in Incidence of COVID-19 Among Underrepresented Racial/Ethnic Groups in Counties Identified as Hotspots During June 5-18, 2020 - 22 States, February-June 2020.

Authors: Jazmyn T Moore; Jessica N Ricaldi; Charles E Rose; Jennifer Fuld; Monica Parise; Gloria J Kang; Anne K Driscoll; Tina Norris; Nana Wilson; Gabriel Rainisch; Eduardo Valverde; Vladislav Beresovsky; Christine Agnew Brune; Nadia L Oussayef; Dale A Rose; Laura E Adams; Sindoos Awel; Julie Villanueva; Dana Meaney-Delman; Margaret A Honein
Journal: MMWR Morb Mortal Wkly Rep Date: 2020-08-21 Impact factor: 17.586