Literature DB >> 35202402

Prediction of clinical trial enrollment rates.

Cameron Bieganek¹, Constantin Aliferis^1,2, Sisi Ma^1,2.

Abstract

Clinical trials represent a critical milestone of translational and clinical sciences. However, poor recruitment to clinical trials has been a long standing problem affecting institutions all over the world. One way to reduce the cost incurred by insufficient enrollment is to minimize initiating trials that are most likely to fall short of their enrollment goal. Hence, the ability to predict which proposed trials will meet enrollment goals prior to the start of the trial is highly beneficial. In the current study, we leveraged a data set extracted from ClinicalTrials.gov that consists of 46,724 U.S. based clinical trials from 1990 to 2020. We constructed 4,636 candidate predictors based on data collected by ClinicalTrials.gov and external sources for enrollment rate prediction using various state-of-the-art machine learning methods. Taking advantage of a nested time series cross-validation design, our models resulted in good predictive performance that is generalizable to future data and stable over time. Moreover, information content analysis revealed the study design related features to be the most informative feature type regarding enrollment. Compared to the performance of models built with all features, the performance of models built with study design related features is only marginally worse (AUC = 0.78 ± 0.03 vs. AUC = 0.76 ± 0.02). The results presented can form the basis for data-driven decision support systems to assess whether proposed clinical trials would likely meet their enrollment goal.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35202402 PMCID： PMC8870517 DOI： 10.1371/journal.pone.0263193

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

Clinical trials represent a critical milestone of translational and clinical science with the most direct impact potential for advancing healthcare related outcomes. Patient recruitment is a necessary condition of success for clinical trials. Under specific situations such as the broad impact of COVID-19, rapid and high volume enrollment for vaccine trials is pivotal to global public health. However, poor recruitment to clinical trials has been a long standing problem affecting institutions in the US and all over the world. Institute of Medicine reports cited 71% of phase III NCI approved trials and 40% or more of NCI sponsored trials closed without meeting their enrollment goals [1, 2]. This incurs very significant costs and wasted resource for the trial sponsors, scientists conducting the trials, and society at large. Stated differently, it creates numerous dead ends for investigators and the translational health sciences enterprise. The National Institutes of Health has recognized these problems and has made improvements in clinical trial enrollment a major focus in recent years. Many initiatives, including the clinical trials transformation initiative and the trial innovation network by the national center for advancing translational sciences, have been established with improving clinical trial enrollment as one of their primary missions [3, 4]. Many barriers to effective enrollment has been identified, including geographic and socioeconomic access, resource constraint, perception of the patients, interest and bandwidth of physicians, among other barriers [4-6]. Approaches for increasing clinical trial enrollment has been proposed. For example, utilizing specialized task force with dedicated resources for trial enrollment can better coordinate resources among multiple simultaneous trials and improve efficiency for enrollment. Establishing centralized clinical trial recruitment management systems with linkage to the electronic health record can facilitate enrollment by identifying potentially eligible patients according to their existing health records. Lastly, recruiting through alternative channels such as social media has also shown promise [7-10]. Improving enrollment for existing trial is critical, but it is only one side of the story. The other key challenge is the cost incurred by launching trials that are unlikely to meet their enrollment goal. The current study aims to minimize this cost by building models to predict the enrollment rate prior to the start of the trial. Accurate enrollment rate prediction model provides valuable information for the cost benefit analysis for launching future clinical trials and is a first step towards a decision support system for clinical trials. Prior literature on clinical trial enrollment prediction falls into two broad categories. The first category uses various types of parametric models for enrollment rate prediction [11-15]. This class of model estimates the number of patients would enroll based on estimated recruitment rate specified by the researcher or the trial enrollment agency. Such models, if successfully built, have in principle the capacity to provide the expected number of patients recruited in a given period of time, as well as confidence intervals around the expected number. Several of the estimation methods also aims to predict recruitment under complex trial settings such as multi-center trials [13, 15]. The main drawback of this class of models is that the accuracy of their prediction heavily relies on the estimated recruitment rate, a piece of information that is generally unavailable prior to the initiation of the trial. Therefore, this class of methods have limited utility for enrollment evaluation prior to the launch of clinical trials. The second category of studies in the existing literature examined factors associated with enrollment rate [7, 16–20]. Factors that have been previously reported to be associated with enrollment rate include recruitment strategy, trial design, seasonality, type of disease studied, disease severity of the potential participants, and their socioeconomic characteristics. However, the majority of the studies investigating the factors influencing enrollment rate only examined a small number of predictor variables, which may result in sub-optimal models for enrollment prediction. Further, the data used for these studies typically came from narrow trial populations. For example, the data are often obtained from single health systems, or restricted to specific diseases. As a result, these identified factors and their quantitative relationship with enrollment rate are less likely to generalize to the broad spectrum of clinical trials conducted nationally. The main objective of the current study is to build models for clinical trial enrollment prediction using available information prior to the initiation of the trial. To address the above mentioned gaps in the literature, we specified four sub-objectives and briefly describe the strategies to achieve these sub-objectives: (1) To build generalizable models over different institutions and organizations that produce robust and stable performance over time: we leveraged a data set consisting of 46,724 U.S. based clinical trials from 1990 to 2020 extracted from ClincalTrials.gov, covering a broad trial population. (2) To examine a larger variety of features for enrollment prediction and improve predictive performance: we constructed a large number of candidate predictors regarding various aspects of individual clinical trials, including characteristics regarding the targeted trial population, the institution responsible for recruitment, the medical domain of the trial, and the design of the trial. (3) To construct predictive models that do not rely on estimated enrollment rate and produce good performances: We applied a variety of state-of-the-art supervised learning methods to build predictive models for enrollment rate prediction. (4) To identify the feature type that is the most informative for enrollment rate prediction: we compared the information content for enrollment rate prediction among models constructed with different feature types. The organization of the manuscript is as the following: In the methods section, we first describe the data acquisition and processing procedures. We then introduce the design of our analysis, including the time series cross-validation protocol for model selection and performance estimation, and the comparison of information content in various feature types. We also introduced the metrics for predictive performance examined. In the results section, we present the performance of various models for predicting enrollment rate. In the discussion section, we highlight the contribution of the study, note the limitations, and point to potential future work. The main contribution of the current study is that we constructed generalizable models for clinical trials enrollment prediction with good predictive performance and are stable over time. Our results empirically demonstrated the feasibility of predicting enrollment rate prior to the initiation of clinical trials. Thus, this study forms the basis for data-driven decision support systems to assess whether proposed clinical trials would likely meet their enrollment goal.

Methods

Data

Our primary data source is ClinicalTrials.gov, a Web-based resource with information on publicly and privately supported clinical studies on a wide range of diseases and conditions, maintained by the National Library of Medicine (NLM). The information regarding a specific clinical study is provided and updated by the sponsor or principal investigator of the clinical study and available for download. We downloaded the XML formatted clinical trial records for all studies on the website on October 28, 2020 from https://clinicaltrials.gov/ct2/resources/download. We restricted our dataset to completed, U.S. based, interventional clinical studies (i.e. clinical trials). The ClinicalTrials.gov website contains data on over 350,000 clinical trials conducted around the world over the last two decades. Filtering this data set to completed, U.S. based clinical trials reduced the number of studies to around 56,000. We implemented an additional filtering step in order to only include studies where the enrollment was listed as “Actual” rather than “Estimated” and to only include studies with a duration of at least 5 days. The final data set included 46,724 studies. The total number of trials by year is shown in Fig 1.

Fig 1

Number of U.S. based clinical trials per year.

Outcome of interest

The target variable of interest for this study was clinical trial enrollment rate. The rate is defined as the total enrollment divided by the study duration, where the total enrollment and study duration were extracted from the XML records. A box plot of the enrollment rates for each year is shown in S2 Fig in S1 File. For enrollment rate prediction, clinical trial enrollment rate, r (number of participants per year), was categorized into three levels: low, medium, and high. The low category was defined to be r ≤ 25, the medium category was defined to be 25 < r ≤ 100, and the high category was defined to be r > 100. Enrollment rates of 25 and 100 correspond to the 48th and 77th percentiles of the enrollment rate distribution. There is a time-dependent drift in the distribution of class membership, which can be seen in Fig 2. Some of the drift in the more recent years is due to a selection bias, which is discussed in a later section.

Fig 2

Distribution of enrollment rate category over time.

Candidate predictors

We are interested in predicting the enrollment rate of clinical trials prior to the trial start; therefore, we used information available before the initiation of the trials as candidate predictors (i.e. features). All the features (Table 1) listed are derived from the ClinicalTrials.gov XML data, except for two: study population and institution score.

Table 1

Characteristics of the clinical trials.

Clinical trials categorical features
Study Phase	Enrollment Rate Class
Study Phase	Low (%)	Medium (%)	High (%)
Early Phase 1	1.1	0.3	0.1
Phase 1	9.2	4.4	4.2
Phase 1/Phase 2	3.3	0.8	0.3
Phase 2	11.4	3.9	3.0
Phase 2/Phase 3	0.7	0.5	0.3
Phase 3	1.5	1.9	3.9
Phase 4	3.8	2.5	1.8
N/A	17.3	14.4	9.5
Agency Class	Low (%)	Medium (%)	High (%)
Industry	7.1	9.0	13.4
NIH	2.9	0.8	0.3
Other	36.7	17.8	8.8
U.S. Fed	1.5	1.1	0.5
Randomized Allocation	Low (%)	Medium (%)	High (%)
N/A	17.8	4.0	2.5
Non-Randomized	7.0	2.5	1.5
Randomized	22.8	22.0	19.0
missing	0.6	0.1	0.1
Intervention Model	Low (%)	Medium (%)	High (%)
Crossover Assignment	4.5	2.8	3.2
Factorial Assignment	0.6	0.7	0.6
Parallel Assignment	19.3	18.8	15.3
Sequential Assignment	0.5	0.3	0.2
Single Group Assignment	22.9	5.9	3.8
missing	0.4	0.1	0.1
Intervention Type	Low (%)	Medium (%)	High (%)
Behavioral	6.0	7.2	4.6
Biological	3.5	1.2	1.2
Combination Product	0.1	0.0	0.1
Device	4.5	3.0	2.5
Diagnostic Test	0.1	0.1	0.1
Dietary Supplement	1.6	1.1	0.4
Drug	25.5	11.3	11.4
Genetic	0.1	0.0	0.0
Other	4.0	3.4	2.4
Procedure	2.4	1.2	0.5
Radiation	0.6	0.0	0.0
Gender Eligibility	Low (%)	Medium (%)	High (%)
All	41.6	24.2	20.2
Female	4.4	3.3	2.0
Male	2.3	1.1	0.9
Healthy Volunteers Eligibility	Low (%)	Medium (%)	High (%)
Accepts Healthy Volunteers	9.9	11.3	11.3
No	38.3	17.4	11.8
MeSH Term	Low (%)	Medium (%)	High (%)
Diabetes Mellitus	1.84	3.13	2.95
Depression	2.33	2.79	1.73
Breast Neoplasms	3.60	1.51	0.55
Depressive Disorder	1.96	2.15	1.50
Syndrome	2.85	1.25	0.73
Clinical trials continuous features
	Enrollment Rate Class
	Low	Medium	High
Eligibility Minimum Age	18 (0)	18 (0)	18 (0)
Eligibility Maximum Age	99 (34)	80 (44)	80 (44)
Facility Count	1 (1)	1 (1)	1 (4)
Population (Million)	4.7 (7.6)	4.3 (11)	5.9 (18)
Institution Score	0 (69)	0 (47)	0 (17)

Characteristics of the clinical trials.

Summary statistics (percentage for categorical variables, median with interquartile range for continuous variables) of features in each enrollment rate categories were shown. “missing” level represents missing value. Only the top 5 most prevalent MeSH terms were shown. We included the Medical Subject Headings (MeSH) as features since they contain information regarding the research topic of the clinical trial. Two different classes of models were investigated: models trained using all 4,624 observed MeSH terms, and models trained on only the most frequently occurring MeSH terms. For the latter case, only MeSH terms that appear in at least 200 clinical trials were included as features. This resulted in the inclusion of 112 MeSH terms. An occurrence count of 200 corresponds to 0.4% of the clinical trials. The most common MeSH term, Diabetes Mellitus, appeared 1151 times, which corresponds to 2.5% of the clinical trials. In total, 53% of the clinical trials in the data set contain at least one of the 112 frequently occurring MeSH terms. The study population feature represents the size of the population from which participants can be recruited. For a clinical trial with a single study center, the study population is the population of the enclosing metropolitan or micropolitan area. For clinical trials with multiple study centers, the study population is the sum of population over all metropolitan and micropolitan areas that are represented by one or more study centers. Statistics on metropolitan and micropolitan area populations were obtained from the U.S. Census Bureau [21] from the url https://www2.census.gov/programs-surveys/popest/tables/2010-2018/state/totals/PEP_2018_PEPANNRES.zip. We used the Nature Index scores for the 2018 calendar year to associate an institution score to each clinical trial to quantify the research capacity and output of the institution responsible for the clinical trial. The data can be obtained from https://www.natureindex.com/annual-tables/2018/institution/all/all/ and selecting “United States of America” as the region/country. The institution associated with a clinical trial was defined to be the institution of the principal investigator for the clinical trial. The Nature Index metrics calculate each institution’s annual contribution of research articles to a curated list of 82 high impact journals. For each article, a share is assigned to each institution involved in writing an article [22]. Shares are assigned to institutions according to the proportion of the authors of an article from each institution—the total share for each article is 1.0. Harvard University had the highest 2018 Nature Index share among U.S. institutions, with a share of 875. The median share among the top 500 U.S. institutions was 7.7. The Nature Index metrics were only available for the top 500 U.S. institutions, so we assigned a share of 0 to institutions that are not included in the top 500. In theory, each record in the data set should use an institution score that was calculated before the study began, to ensure that information about future institution performance does not effect the learning procedure. However, the annual Nature Index scores were only calculated starting in 2016. We assume that institution score is a mostly time-independent metric, so that the 2018 institution score can be incorporated into the data set without creating a significant bias in the results.

Analytical strategy

Performance estimation and model selection

To evaluate the performance of predicting future enrollment based on historical data and select the best model for the task, we implemented a nested time-series cross validation design [23-25]. Time series cross-validation is a modified version of cross-validation where the observations in the training data occur prior to those in the validation data. Nested cross-validation consists of two levels of cross-validation. The inner cross-validation is used to optimize the model hyper-parameter settings, and the outer cross-validation measures the performance of the hyper-parameter tuned models. In our study, both the inner and outer cross-validations were time series cross-validations. Specifically, we considered 14 validation data sets for the outer cross-validation, each corresponding to one year of data in the period 2006–2019. Data collected from years prior to the validation data were used to train and select the best performing hyper-parameters for the models via a cross-validated hyper-parameter grid search. Since we are interested in predicting low, medium, and high enrollment rates, we considered the following multi-class classification algorithms (with multiple choices for hyper-parameters where applicable, see S1 Table in S1 File) for constructing the predictive models: multinomial logistic regression, k-nearest neighbors (KNN), multinomial elastic net, support vector machine (SVM), and random forest. The best hyper-parameter combination for each classifier was selected on the training data in the inner loop of the nested time-series cross validation. We included a dummy classifier as reference for the performance metrics that depend on class distribution. The dummy classifier makes predictions according to the prior distribution of the target variable in the training set.

Variable scaling and missing values

Variable scaling and treatment for missing values were incorporated into the modeling pipeline, so that scaling and imputation on the validation data are done according to the distribution of the training data. This prevents information leak from the validation data into the model training phase. The continuous variables were transformed using a standardizing scaler, except for the study population, which was scaled with a min-max scaler due to the skewed distribution for that variable. The standardizing scaler recenters a variable by subtracting the sample mean of the variable and then rescales the variable by dividing by the sample standard deviation. The min-max scaler rescales a variable to the range [−1, 1]. Median imputation was used for the continuous variables, and missing indicator columns were added to retain the missingness information. For the categorical variables with missing data, we added a “missing” level to the categories for that variable to represent the missingness information.

Performance metrics

We used the multi-class area under the receiver operating characteristic curve (AUC), accuracy, recall, and precision as the metrics for evaluating the predictive performance of the models. Let c be the class labels and let K be the total number of classes. Then the formula for the multi-class AUC [26, 27] is where AUC(c, c) is the two-class AUC with c as the positive class and c as the negative class. Note that in the multi-class case, AUC(c, c) ≠ AUC(c, c). For recall, we use the macro-average recall, which averages the positive identification rate for each class. The macro-average recall is invariant to changes in the class distribution [26]. For any set of predictions from a classifier, we can tabulate a confusion matrix, C, such that the C element of the confusion matrix is a count of the observations for which the predicted class is c and the true class is c. Let be the total number of samples with true class c. Then the macro-average recall is calculated as In a similar fashion, the macro-average precision can be defined as where is the total number of observations with predicted class c. Finally, the multi-class accuracy is defined analogously to the binary accuracy: where N is the total number of observations. For an in-depth discussion of the properties of these metrics, please see [28, 29]

Information content analysis

In order to examine the predictive performance of different types of features in the data set, we trained classifiers on different sets of features and compared the predictive performances to the model trained with all features. We considered the following feature sets: (1) population: the population from which participants can be recruited; (2) study center: population, facility count, and institution score; (3) study design: information related to the design of the clinical trial; (4) MeSH: characteristics regarding the medical domain of the trial; (5) complete: items (1)-(4). The features belonging to each set are shown in Table 2.

Table 2

The feature sets used for information content analysis.

Feature	Complete	Population	Study Center	Study Design	MeSH
Study phase	X			X
Funding agency class	X			X
Randomized allocation	X			X
Intervention model	X			X
Intervention type	X			X
Eligibility gender	X			X
Accepts healthy volunteers	X			X
Eligibility minimum age	X			X
Eligibility maximum age	X			X
Study center count	X		X
Study population	X	X	X
Institution score	X		X
MeSH terms	X				X

Additionally, feature sets including MeSH terms (feature sets (4) and (5)) were instantiated in two ways, either using all 4,624 observed MeSH terms, or using only the most frequently observed MeSH terms. Except where otherwise noted, results below are reported for models trained using all MeSH terms.

Selection bias and domain adaptation

Only completed clinical trials were included in this study, since the total enrollment is not known until a trial is completed. In the more recent years, the number of completed trials shows a decreasing trend (Fig 1). This is due to the fact that many of the trials that were started in recent years have not completed yet. This introduces a selection bias by including only shorter duration studies in the years close to 2020 (S1, S3 Figs in S1 File) and introduces a shift in the distribution of the enrollment rate (S2 Fig in S1 File) and enrollment rate categories (Fig 2). As a result, a difference in the distribution is present between the training data (earlier in time) and the validation data (more recent in time), especially for the more recent years. The distribution shift may impact model performance, since most machine learning techniques assume that training and validation data are drawn from the same distribution. Therefore, we implemented domain adaptation, a family of techniques designed to alleviate the effects of distribution shift [30-33]. Specifically, We employed an approach where training samples are weighted according to their likelihood in the validation set. Based on the chronological nature of the study duration selection bias in our experiment, we derived the following simple formula for the weights: where w is the weight for training sample i, t is the study duration for sample i, and τ is the maximum possible study duration that can occur in the validation set. I is the indicator function. The resulting weight for each sample is either 0 or 1. Therefore, applying these weights to the training data is equivalent to filtering out the training samples with duration longer than τ. A mathematical justification for this approach can be found in Supplemental Information on mathematical justification for domain adaptation. We trained models with and without domain adaptation to investigate the impact of the selection bias.

Results

Characteristics of the ClinicalTrials.gov data

We analyzed 46,724 U.S. based clinical trials from 1990–2020. The distribution of the enrollment rate categories over time is shown in Fig 2. The class distribution for enrollment rate showed a shift over time, where the proportion of clinical trials with low enrollment rates decreased over time and the proportion of clinical trials with medium and high enrollment rates increased. This is partly due to the fact that we only considered clinical trials completed by October of 2020. Therefore, the more recent clinical trials included in the analysis have shorter study duration. There is an inverse relationship between study duration and enrollment rate S3 Fig in S1 File, so the bias towards shorter clinical trials results in a bias towards higher enrollment rates. We used information extracted from ClinicalTrials.gov and external sources (population count of the location of the recruitment and institutional score) as features for enrollment rate. The summary statistics of these features are included in Table 1.

Predictive performance of models built with all features

A comparison of the performance on the validation set (years 2006 to 2019) of the models constructed by various classification methods using all 4,636 candidate predictors is shown in Fig 3. The predictive performance of all evaluated classification methods outperformed the dummy classifier, where predictions were made according to the distribution of the target variable in the training data. This indicates that the features, which capture information available prior to the initiation of the trial, contain information regarding the enrollment rate. Furthermore, the predictive performance of the models constructed by the various classification methods was very similar across all the performance metrics. The SVM resulted in the best overall performance over years 2006 to 2019 with AUC = 0.81 ± 0.02 (mean ± std), recall = 0.58 ± 0.02, precision = 0.59 ± 0.04, and accuracy = 0.62 ± 0.05. The random forest model was the second best and only performed minimally worse compared to the SVM with AUC = 0.78 ± 0.03 (mean ± std), recall = 0.57 ± 0.02, precision = 0.58 ± 0.04, and accuracy = 0.62 ± 0.05. The performance of the other classifiers is show in Fig 3 and also listed in S2 Table in S1 File.

Fig 3

Predictive performance of models constructed with various classification methods using the complete set of 4,636 features.

Predictive performance of models constructed with various classification methods using the complete set of 4,636 features.

The predictive performance for the various classifiers is similar, and each outperforms the dummy classifier. Note: On most plots, the performance of the logistic classifier is not visible, since its performance is the same as the elastic net. We also investigated the model performance on a reduced variable set where MeSH terms that appeared less than 200 times were not included. For the feature set containing these most frequent MeSH terms along with all other features. The SVM classifier did not converge for experimental setting in the allotted run time of 96 hours. Among the methods that finished execution in all experimental setting. The best model was the random forest classifier. The random forest had similar performance compared to when all MeSH terms were used, with average performance over years 2006 to 2019 of AUC = 0.78 ± 0.02 (mean ± std), recall = 0.57 ± 0.01, precision = 0.58 ± 0.03, and accuracy = 0.62 ± 0.04. The performance of the other classifiers is listed in S3 Table in S1 File.

Information content in different types of features

To assess the relative information content in various feature types, we built models using feature sets listed in Table 2. We report below the results from the random forest classifier (see Fig 4), since SVM, the best overall classifier on the complete feature set, failed to converge on two feature types of interest. The results from all classifiers were similar and can be found in S2 Table in S1 File. The feature set containing study design related information (e.g. phase, funding agency, randomization, intervention, and eligibility criterion) resulted in the best predictive performance among all the feature sets, with an AUC = 0.76 ± 0.02 over all years. All other feature sets examined contained weak to moderate predictive information regarding enrollment rate on their own. The models based on only the population of the recruitment region, the study center characteristics, and the MeSH terms resulted in AUC of 0.59 ± 0.01, 0.62 ± 0.01, and 0.67 ± 0.02, respectively. Moreover, adding those features to the study design features only provided a marginal increase in model performance—the AUC of the model with the complete set of features was 0.78 ± 0.03, while the AUC of the model with the study design features was 0.76 ± 0.02.

Fig 4

Predictive performance of random forest models with different feature subsets.

Selection bias and domain adaptation

Due to the selection bias present in the more recent years of data (e.g. Fig 2), we employed domain adaptation and compared the predictive performance of models with and without domain adaptation. The effect of the domain adaptation on model performance was negligible when models were trained on the full data set, including all MeSH terms. However, domain adaptation had a larger impact when models were trained on the data set where MeSH terms that appeared less than 200 times were not included. This result is demonstrated in Fig 5 for a random forest model, where it is seen that domain adaptation brought the performance for recent years more in line with the performance for the earlier years. The average AUC across 2016 to 2019 was 0.76 ± 0.01 and 0.73 ± 0.01 with and without domain adaptation. The impact of domain adaptation on the predictive performance for the earlier years was minimal, since for those years the distribution shift between the training and validation data was absent or minimal.

Fig 5

Predictive performance of random forest models with and without domain adaptation on dataset with reduced MeSH features.

Discussion

Prediction of clinical trials enrollment rate

In the current study, we quantified the predictive signal for clinical trial enrollment rate in trial characteristics available prior to their initiation. We adopted a nested time-series cross-validation design. This design takes into account the data availability with respect to time for both model selection and performance estimation, reducing the bias for estimating future model performance. We also explored a variety of predictive modeling supervise learning methods. Our modeling procedure achieved good predictive performance that is stable over time and across different classifiers. This indicates that the modeling procedure produced models that were stable and generalizable to future data. Moreover, using a reduced feature set where MeSH terms that appeared less than 200 times were excluded resulted in similar performance. With respect to the information content in various types of features examined, we found that the features regarding study design (including study phase, funding agency, randomization, interventions, and eligibility) were the most informative compared to other feature subsets. The other feature subsets (population, study center, and MeSH terms) contain information regarding enrollment rate, but the information is largely overlapping with the information in the study design related features, since the combined information in all features only marginally improves upon the information in the study design features. Our study revealed that the ClinicalTrials.gov data contains rich information that can be leveraged for predicting enrollment rate. We also showed that external information (population of recruitment location and information regarding the recruiting institution) can be retrieved to supplement data collected by ClinicalTrials.gov.

Limitations and future work

We only examined structured field from the ClinicalTrials.gov as features for predicting enrollment rate. Unstructured free-text data summarizing the goal and procedures of each study is also available from the ClinicalTrials.gov. Future work could employ Natural Language Processing methods to extract information and construct features from the free-text data for enrollment rate prediction. Natural Language Processing methods applied to the free-text data from ClinicalTrials.gov has been explored for various applications including construction of knowledge base for eligibility criterion [34], illustration of relationships among multiple clinical trials [35], and literature mining for trends and prevalence of clinical instruments [36]. However, using Natural Language Processing methods for feature construction for predictive task using the ClinicalTrials.gov data is an untapped area. We expect the combination of free-text based features and structured data could improve the predictive performance for enrollment rate, as demonstrated in prior studies in other data domain [37-39]. Also, The current study limited the external data source to the study population obtained from the U.S. census Bureau [21] and the institutional score defined by the Nature Index score [22]. For future work, we believe exploring additional data sources for enrollment rate may further improve predictive performance for enrollment rate. Examples from a variety of problem domains has demonstrated that obtaining information from multiple sources and integrating them with machine learning methods can result in high quality models [40-43]. Potential external sources of information include information related to the principle investigator of the trial ([40] showed several possible feature constructions related to this), the prevalence of the disease or condition in question near the recruitment location ([44] demonstrated a way to construct these variables), and the characteristics of the population (age, gender, education, insurance status, etc., can be obtained from census data [21]) near the recruitment location.

Conclusions

The current study emperically demonstrated the feasibility of enrollment prediction prior to the initiation of clinical trials. To the best of our knowledge, this is the first study that employs machine learning methods for enrollment rate prediction based on a large and comprehensive sample of U.S.-based clinical trials. This study is the first step towards a data-driven decision support system for assessing whether a proposed clinical trial would likely meet its enrollment goal. (PDF) Click here for additional data file. 20 Dec 2021

PONE-D-21-37339

Prediction of clinical trial enrollment rates

PLOS ONE Dear Dr. Ma, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Feb 03 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Sathishkumar V E Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section. 3. Thank you for stating the following in the Acknowledgments Section of your manuscript: "SM’s time on this work is partially supported by Grant UL1TR002494" We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: "SM's time on this work is partially supported by Grant UL1TR002494. " Please include your amended statements within your cover letter; we will change the online submission form on your behalf. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: No Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The Research Paper needs the following Major Revisions and is Subject for re-review, and after re-review, the final decision for the paper will be done: a. Abstract- In the last lines, highlight regarding experimental analysis. b. Introduction should be more broad and should cover more towards Problem Definition and Scope. Add Objectives of the paper at the end of Introduction Add Organization of the paper c. Add some Literature review to this paper with min 4-10 Papers. d. Add some case study based discussion to the paper. e. Add conclusion and future scope to the paper f. Addition of min 5-10 Latest references of 2021 cum 2022 are recommended to th epaper. Reviewer #2: This article performs the clinical examination on 4,636 candidate predictors based on data collected by ClinicalTrials.gov and external sources for enrollment rate prediction using various state-of-the-art machine learning methods.This article studied well and may be considered for publication after addressing the below: 1. Highlight the contributions in the Introduction. 2. Provide the numbering for subsections of the article. 3. PRovide the literature study. 4. The formulae available in the existing papers must be cited. 5. Add the conclusion to the article. 6. List the limitations for the proposed work. 7. How the authors considered missing values in the datasets? ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Anand Nayyar Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 10 Jan 2022 The response to review is attached with the revised manuscript (at the end of the combined pdf). Submitted filename: response_to_review.pdf Click here for additional data file. 14 Jan 2022 Prediction of clinical trial enrollment rates PONE-D-21-37339R1 Dear Dr. Ma, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Sathishkumar V E Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The Paper stands Accepted with no further revisions. All the comments are addressed and suitable revisions are done. Reviewer #2: The authors addressed all the comments and the current version of the article is recommended for the publication. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Anand Nayyar Reviewer #2: No 10 Feb 2022 PONE-D-21-37339R1 Prediction of clinical trial enrollment rates Dear Dr. Ma: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Sathishkumar V E Academic Editor PLOS ONE

27 in total

1. Predicting accrual in clinical trials with Bayesian posterior predictive distributions.

Authors: Byron J Gajewski; Stephen D Simon; Susan E Carlson
Journal: Stat Med Date: 2008-06-15 Impact factor: 2.373

2. Improving Clinical Trial Enrollment - In the Covid-19 Era and Beyond.

Authors: Crystal M North; Michael L Dougan; Chana A Sacks
Journal: N Engl J Med Date: 2020-07-15 Impact factor: 91.245

3. Statistical modeling and prediction of clinical trial recruitment.

Authors: Yu Lan; Gong Tang; Daniel F Heitjan
Journal: Stat Med Date: 2018-11-08 Impact factor: 2.373

4. A sense of urgency: Evaluating the link between clinical trial development time and the accrual performance of cancer therapy evaluation program (NCI-CTEP) sponsored studies.

Authors: Steven K Cheng; Mary S Dietrich; David M Dilts
Journal: Clin Cancer Res Date: 2010-11-09 Impact factor: 12.531

5. Natural Language Processing of Clinical Notes for Improved Early Prediction of Septic Shock in the ICU.

Authors: Ran Liu; Joseph L Greenstein; Sridevi V Sarma; Raimond L Winslow
Journal: Conf Proc IEEE Eng Med Biol Soc Date: 2019-07

6. COVID-19 trial graph: a linked graph for COVID-19 clinical trials.

Authors: Jingcheng Du; Qing Wang; Jingqi Wang; Prerana Ramesh; Yang Xiang; Xiaoqian Jiang; Cui Tao
Journal: J Am Med Inform Assoc Date: 2021-08-13 Impact factor: 4.497

7. Predictors of Cardiac Rehabilitation Participation: OPPORTUNITIES TO INCREASE ENROLLMENT.

Authors: Sherrie Khadanga; Patrick D Savage; Diann E Gaalema; Philip A Ades
Journal: J Cardiopulm Rehabil Prev Date: 2021-09-01 Impact factor: 3.646

8. Parental Factors Associated With the Decision to Participate in a Neonatal Clinical Trial.

Authors: Elliott Mark Weiss; Aleksandra E Olszewski; Katherine F Guttmann; Brooke E Magnus; Sijia Li; Anita R Shah; Sandra E Juul; Yvonne W Wu; Kaashif A Ahmad; Ellen Bendel-Stenzel; Natalia A Isaza; Andrea L Lampland; Amit M Mathur; Rakesh Rao; David Riley; David G Russell; Zeynep N I Salih; Carrie B Torr; Joern-Hendrik Weitkamp; Uchenna E Anani; Taeun Chang; Juanita Dudley; John Flibotte; Erin M Havrilla; Charmaine M Kathen; Alexandra C O'Kane; Krystle Perez; Brenda J Stanley; Benjamin S Wilfond; Seema K Shah
Journal: JAMA Netw Open Date: 2021-01-04

9. Methods to improve recruitment to randomised controlled trials: Cochrane systematic review and meta-analysis.

Authors: Shaun Treweek; Pauline Lockhart; Marie Pitkethly; Jonathan A Cook; Monica Kjeldstrøm; Marit Johansen; Taina K Taskila; Frank M Sullivan; Sue Wilson; Catherine Jackson; Ritu Jones; Elizabeth D Mitchell
Journal: BMJ Open Date: 2013-02-07 Impact factor: 2.692

10. Health and Health Determinant Metrics for Cities: A Comparison of County and City-Level Data.

Authors: Ben R Spoer; Justin M Feldman; Miriam L Gofine; Shoshanna E Levine; Allegra R Wilson; Samantha B Breslin; Lorna E Thorpe; Marc N Gourevitch
Journal: Prev Chronic Dis Date: 2020-11-05 Impact factor: 2.830