| Literature DB >> 35202402 |
Cameron Bieganek1, Constantin Aliferis1,2, Sisi Ma1,2.
Abstract
Clinical trials represent a critical milestone of translational and clinical sciences. However, poor recruitment to clinical trials has been a long standing problem affecting institutions all over the world. One way to reduce the cost incurred by insufficient enrollment is to minimize initiating trials that are most likely to fall short of their enrollment goal. Hence, the ability to predict which proposed trials will meet enrollment goals prior to the start of the trial is highly beneficial. In the current study, we leveraged a data set extracted from ClinicalTrials.gov that consists of 46,724 U.S. based clinical trials from 1990 to 2020. We constructed 4,636 candidate predictors based on data collected by ClinicalTrials.gov and external sources for enrollment rate prediction using various state-of-the-art machine learning methods. Taking advantage of a nested time series cross-validation design, our models resulted in good predictive performance that is generalizable to future data and stable over time. Moreover, information content analysis revealed the study design related features to be the most informative feature type regarding enrollment. Compared to the performance of models built with all features, the performance of models built with study design related features is only marginally worse (AUC = 0.78 ± 0.03 vs. AUC = 0.76 ± 0.02). The results presented can form the basis for data-driven decision support systems to assess whether proposed clinical trials would likely meet their enrollment goal.Entities:
Mesh:
Year: 2022 PMID: 35202402 PMCID: PMC8870517 DOI: 10.1371/journal.pone.0263193
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Fig 1Number of U.S. based clinical trials per year.
Fig 2Distribution of enrollment rate category over time.
Characteristics of the clinical trials.
Summary statistics (percentage for categorical variables, median with interquartile range for continuous variables) of features in each enrollment rate categories were shown. “missing” level represents missing value. Only the top 5 most prevalent MeSH terms were shown.
| Clinical trials categorical features | |||
| Study Phase | Enrollment Rate Class | ||
| Low (%) | Medium (%) | High (%) | |
| Early Phase 1 | 1.1 | 0.3 | 0.1 |
| Phase 1 | 9.2 | 4.4 | 4.2 |
| Phase 1/Phase 2 | 3.3 | 0.8 | 0.3 |
| Phase 2 | 11.4 | 3.9 | 3.0 |
| Phase 2/Phase 3 | 0.7 | 0.5 | 0.3 |
| Phase 3 | 1.5 | 1.9 | 3.9 |
| Phase 4 | 3.8 | 2.5 | 1.8 |
| N/A | 17.3 | 14.4 | 9.5 |
| Agency Class | Low (%) | Medium (%) | High (%) |
| Industry | 7.1 | 9.0 | 13.4 |
| NIH | 2.9 | 0.8 | 0.3 |
| Other | 36.7 | 17.8 | 8.8 |
| U.S. Fed | 1.5 | 1.1 | 0.5 |
| Randomized Allocation | Low (%) | Medium (%) | High (%) |
| N/A | 17.8 | 4.0 | 2.5 |
| Non-Randomized | 7.0 | 2.5 | 1.5 |
| Randomized | 22.8 | 22.0 | 19.0 |
| missing | 0.6 | 0.1 | 0.1 |
| Intervention Model | Low (%) | Medium (%) | High (%) |
| Crossover Assignment | 4.5 | 2.8 | 3.2 |
| Factorial Assignment | 0.6 | 0.7 | 0.6 |
| Parallel Assignment | 19.3 | 18.8 | 15.3 |
| Sequential Assignment | 0.5 | 0.3 | 0.2 |
| Single Group Assignment | 22.9 | 5.9 | 3.8 |
| missing | 0.4 | 0.1 | 0.1 |
| Intervention Type | Low (%) | Medium (%) | High (%) |
| Behavioral | 6.0 | 7.2 | 4.6 |
| Biological | 3.5 | 1.2 | 1.2 |
| Combination Product | 0.1 | 0.0 | 0.1 |
| Device | 4.5 | 3.0 | 2.5 |
| Diagnostic Test | 0.1 | 0.1 | 0.1 |
| Dietary Supplement | 1.6 | 1.1 | 0.4 |
| Drug | 25.5 | 11.3 | 11.4 |
| Genetic | 0.1 | 0.0 | 0.0 |
| Other | 4.0 | 3.4 | 2.4 |
| Procedure | 2.4 | 1.2 | 0.5 |
| Radiation | 0.6 | 0.0 | 0.0 |
| Gender Eligibility | Low (%) | Medium (%) | High (%) |
| All | 41.6 | 24.2 | 20.2 |
| Female | 4.4 | 3.3 | 2.0 |
| Male | 2.3 | 1.1 | 0.9 |
| Healthy Volunteers Eligibility | Low (%) | Medium (%) | High (%) |
| Accepts Healthy Volunteers | 9.9 | 11.3 | 11.3 |
| No | 38.3 | 17.4 | 11.8 |
| MeSH Term | Low (%) | Medium (%) | High (%) |
| Diabetes Mellitus | 1.84 | 3.13 | 2.95 |
| Depression | 2.33 | 2.79 | 1.73 |
| Breast Neoplasms | 3.60 | 1.51 | 0.55 |
| Depressive Disorder | 1.96 | 2.15 | 1.50 |
| Syndrome | 2.85 | 1.25 | 0.73 |
| Clinical trials continuous features | |||
| Enrollment Rate Class | |||
| Low | Medium | High | |
| Eligibility Minimum Age | 18 (0) | 18 (0) | 18 (0) |
| Eligibility Maximum Age | 99 (34) | 80 (44) | 80 (44) |
| Facility Count | 1 (1) | 1 (1) | 1 (4) |
| Population (Million) | 4.7 (7.6) | 4.3 (11) | 5.9 (18) |
| Institution Score | 0 (69) | 0 (47) | 0 (17) |
The feature sets used for information content analysis.
| Feature | Complete | Population | Study Center | Study Design | MeSH |
|---|---|---|---|---|---|
| Study phase |
|
| |||
| Funding agency class |
|
| |||
| Randomized allocation |
|
| |||
| Intervention model |
|
| |||
| Intervention type |
|
| |||
| Eligibility gender |
|
| |||
| Accepts healthy volunteers |
|
| |||
| Eligibility minimum age |
|
| |||
| Eligibility maximum age |
|
| |||
| Study center count |
|
| |||
| Study population |
|
|
| ||
| Institution score |
|
| |||
| MeSH terms |
|
|
Fig 3Predictive performance of models constructed with various classification methods using the complete set of 4,636 features.
The predictive performance for the various classifiers is similar, and each outperforms the dummy classifier. Note: On most plots, the performance of the logistic classifier is not visible, since its performance is the same as the elastic net.
Fig 4Predictive performance of random forest models with different feature subsets.
Fig 5Predictive performance of random forest models with and without domain adaptation on dataset with reduced MeSH features.