| Literature DB >> 33339830 |
Laura Miron1, Rafael S Gonçalves2, Mark A Musen2.
Abstract
Metadata that are structured using principled schemas and that use terms from ontologies are essential to making biomedical data findable and reusable for downstream analyses. The largest source of metadata that describes the experimental protocol, funding, and scientific leadership of clinical studies is ClinicalTrials.gov. We evaluated whether values in 302,091 trial records adhere to expected data types and use terms from biomedical ontologies, whether records contain fields required by government regulations, and whether structured elements could replace free-text elements. Contact information, outcome measures, and study design are frequently missing or underspecified. Important fields for search, such as condition and intervention, are not restricted to ontologies, and almost half of the conditions are not denoted by MeSH terms, as recommended. Eligibility criteria are stored as semi-structured free text. Enforcing the presence of all required elements, requiring values for certain fields to be drawn from ontologies, and creating a structured eligibility criteria element would improve the reusability of data from ClinicalTrials.gov in systematic reviews, metanalyses, and matching of eligible patients to trials.Entities:
Mesh:
Year: 2020 PMID: 33339830 PMCID: PMC7749162 DOI: 10.1038/s41597-020-00780-z
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Data entry form in the PRS system. One of several form pages for entering data in the PRS. Red asterisks indicate required fields; red asterisks with a section sign indicate fields required since January 18, 2017. Additional instructions are provided for ‘study phase’ and ‘masking’ fields, and automated validation messages of levels ‘Note’, ‘Warning’ and ‘Error’ can be seen. A validation rule ensures that the value for ‘number of arms’ is an integer. Another rule checks both the chosen value for ‘study phase’ (Phase 1), and the (lack of) interventions that are enumerated on a separate page of the entry system. However, there is unexplained inconsistency in the warning levels for missing required elements (missing ‘masking’ generates a ‘note’, while missing ‘interventional study model’ and ‘number of arms’ both generate ‘warnings’).
Warning Levels in the Protocol Registration System.
| Type | Explanation |
|---|---|
| Error | Problems that must be addressed (e.g., missing required content, internal inconsistency) |
| Warning | Items that are FDAAA |
| Alert | Problems that need to be addressed |
| Note |
Significant Fields in ClinicalTrials.gov.
| The name(s) of the disease(s) or condition(s) studied in the clinical study | |
| The intervention(s) associated with each arm or group, most commonly a drug, device, or procedure | |
| A limited list of criteria for selection of participants in the clinical study, provided in terms of inclusion and exclusion criteria | |
| A prespecified measurement used to determine the effect of experimental variables on human subjects in a clinical study | |
| Either a “Sponsor”, “Principal Investigator”, or “Sponsor-Investigator”; the party responsible for submitting information about a trial | |
| Person responsible for the overall scientific leadership of the protocol, including study principal investigator |
Definitions are adapted from the field definitions provided in the ClinicalTrials.gov data dictionary.
Adherence to type expectations for Boolean, integer, date, and age fields.
| Type | Num. fields | Field Names | Format |
|---|---|---|---|
| Boolean | 11 | ‘Yes|No’ | |
| Integer | 3 | xs:integer | |
| Date | 4 | ‘(Unknown|((January|February|March|April|May|June|July|August|September|October|November|December) (([12]?[0–9]|30|31),)?[12][0–9]{3}))’, | |
| Age | 2 | ‘N/A|([1–9][0–9]*(Year|Years|Month|Months|Week|Weeks|Day|Days|Hour|Hours|Minute|Minutes))’ |
All Boolean, integer, date, and age fields are typed in the XSD, and all values for these fields in all public records are correctly typed. Date XML elements may optionally have an attibute designating them ‘Actual’, ‘Anticipated’, or ‘Estimate’. For age fields, records may represent equivalent ages with different units (e.g., ‘2 Years’ and ‘24 Months’).
Enumerated value fields validation results.
| Field | Valid Value Set (data dictionary) | Records With Rogue Values | Observed Rogue Values | Value Set Defined in XSD? |
|---|---|---|---|---|
| Study Type | Interventional, Observational, Observational [Patient Registry], Expanded Access | 0 | — | |
| Overall Recruitment Status | Not yet recruiting, Recruiting, Enrolling by invitation, “Active, not recruiting”, Completed, Suspended, Terminated, Withdrawn | 0 | — | |
| Responsible Party, by Official Title | Sponsor, Principal Investigator, Sponsor-Investigator | 0 | — | |
| Study Phase | N/A, Early Phase 1, Phase 1, Phase 1/Phase 2, Phase 2/Phase 3, Phase 3, Phase 4 | 0 | — | |
| Intervention Type | Drug, Device, Biologic/Vaccine, Procedure/Surgery, Radiation, Behavioral, Genetic, Dietary Supplement, Combination Product, Diagnostic Test, Other | 0 | — | |
| Sex | All, Male, Female | 0 | — | |
| Sampling Method | Probability Sample, Non-Probability Sample | 0 | — | |
| Overall Study Official’s Role | Study Chair, Study Director, Study Principal Investigator | 0 | — | |
| Individual Site Status | Not yet recruiting, Recruiting, Enrolling by invitation, “Active, not recruiting”, Completed, Suspended, Terminated, Withdrawn | 0 | — | |
| Interventional Study Model | Single Group, Parallel, Crossover, Factorial, Sequential | 0* | Notes: Values appear in XML as “Single Group Assignment”, “Parallel Group Assignment”, etc. | |
| Masking | 0* | Notes: Values appear in XML as “Double(Participant, Care Provider)”, “Single(Investigator)”, etc. | ||
| Primary Purpose | Treatment, Prevention, Diagnostic, Supportive Care, Screening, Health Services Research, Basic Science, Device Feasibility, Other | 196 | Educational/Counseling/Training | |
| Allocation | N/A, Randomized, Nonrandomized | 78 | Random Sample | |
| Arm Type | Experimental, Active Comparator, Placebo Comparator, Sham Comparator, No Intervention, Other | 21 | Case, Control, Treatment Comparison | |
| Observational Study Model | Cohort, Case-Control, Case-Only, Case-Crossover, Ecologic or Community Studies, Family-Based, Other | 5343 | Case Control, Defined Population, Natural History | |
| Time Perspective | Retrospective, Prospective, Cross-sectional, Other | 622 | Longitudinal, Retrospective/Prospective |
Sixteen fields have an enumerated set of permissible values in the ClinicalTrials.gov data dictionary, but 6 are not typed within the XSD. Four of these 6 contain rogue values. For the interventional study model and masking fields, all values in public records are valid, but the data dictionary does not correctly describe the format of values. The actual format of interventional study model values include the word ‘assignment’ (e.g., ‘Parallel Group Assignment’ rather than ‘Parallel Group’). Values for masking include the word ‘single/double/triple/quadruple’ in addition to the types of individuals providing masking.
Missing required fields, before and after passage of FDA Final Rule.
| Required Field Name | Number of Interventional Records Missing Field | Percentage of Records Missing Field | ||
|---|---|---|---|---|
| Trials starting before 01/18/17, effective date of Final Rule (n = 192985) | Trials starting on or after 01/18/17 (n = 46289) | All Interventional Trials (n = 239274) | ||
| (B) Official Title | 7038 | 10 | 2.9% | |
| (D) Primary Purpose | 8253 | 6 | 3.5% | |
| (E) Study Design | interventional study model | 6834 | 0 | 2.9% |
| number of arms | 23,832 | 296 | 10% | |
| allocation | 44,552 | 11938 | 24% | |
| masking | 5423 | 1 | 2.3% | |
| arm information | 23,832 | 296 | 10% | |
| (L) Intervention Description, for each intervention studied | 45,433 | 29 | 19% | |
| (S) Study Start Date | 2950 | 3 | 1.3% | |
| (T) Primary Completion Date | 15,044 | 0 | 6.3% | |
| (U) Study Completion Date | 13,516 | 21 | 5.7% | |
| (V) Enrollment | 3966 | 0 | 1.7% | |
| (W) Primary Outcome Measure Information | outcome measures | 7,786 | 0 | 3.3% |
| time frame | 12,865 | 0 | 5.4% | |
| description | 82,862 | 5,086 | 37% | |
| (D) Accepts Healthy Volunteers | 1,323 | 0 | .55% | |
| (F) Why Study Stopped | 4,385 | 2 | 1.8% | |
| (G) Individual Site Status | 0 | 0 | 0% | |
| (H) Availability of Expanded Access | 3,158 | 833 | 1.7% | |
| (C) Facility Information | no listed facilities | 20,351 | 5,602 | 11% |
| facility name, city, or country | 7,899 | 17 | 3.3% | |
| Overall contact info OR contact info for each site | 163,905 | 13,123 | 74% | |
For fields required by the FDAAA801 Final Rule, table lists the percentage of all interventional records (n = 239,274) missing the field, and the percentage of all interventional records with start dates after the effective date of the Final Rule (n = 46,289) missing the field. m indicates multiple instances of field are permitted; a multiple field is considered ‘missing’ if there are no listed occurrences of field. c indicates a conditionally required element, such as Why Study Stopped, which is required only if the study terminated before its expected completion date. Conditionally required elements are considered missing if they are both missing and conditionally required for the given record.
Fig. 2Percentage of Interventional Records Missing Required Field Values, by Agency Class of Lead Sponsor Percentage of ClinicalTrials.gov interventional trial records (n = 239,274) missing values for selected fields required by FDAAA801. Records are categorized by the agency class of the lead sponsor, which is either “NIH”, “U.S. Fed”, “Industry”, or “Other”.
Synonyms Added by ClinicalTrials.gov for Search Term “Cancer”.
| Query Term | Num. Search Results |
|---|---|
| cancer | 74,385 studies |
| Neoplasm | 66,572 studies |
| Tumor | 16,473 studies |
| Malignancy | 3,128 studies |
| Oncology | 1,249 studies |
| Neoplasia | 622 studies |
| neoplastic syndrome | 592 studies |
| Neoplastic Disease | 22 studies |
Behind the scenes, the ClinicalTrials.gov search portal adds 7 synonyms to a user query for “cancer”. Note: Inconsistent capitalization accurately reflects how terms are displayed in ClinicalTrials.gov.
Fig. 3Percentage of values for the condition field covered by each UMLS Ontology Each column gives the percentage of the 497,124 values for the condition field contained in ClinicalTrials.gov records that are an exact match for a term from the given ontology. Of 72 ontologies in the UMLS, 29 contained at least one match for a condition value, and 43 contained no matches (omitted from figure). Many condition values have exact matches in more than one ontology. The ontologies that provide the most coverage for condition values are MeSH (62%), MedDRA (46%), and SNOMED-CT (45%).
Fig. 4Percentage of values for the intervention field for which we found an exact match in at least one ontology hosted in NCBO BioPortal, grouped by intervention type Thirty-nine percent of the 557,436 values listed for intervention contained in ClinicalTrials.gov records are an exact match to a term from a BioPortal ontology, without any pre-parsing or normalization, indicating that this field could reasonably support ontology restrictions. Some intervention types are much better represented by ontology terms than others. More than half of all drugs and radiation therapies use ontology terms, but less than 15% of listed devices and combination products do.
Number of records with missing and incorrectly formatted eligibility criteria.
| Number of Records (n = 302,091) | |||
|---|---|---|---|
| Correctly Formatted Eligibility Criteria | Correct headers, but not formatted as a bulleted list | Missing or Malformed Headers | Missing Eligibility Criteria |
| 183,309 (60.7%) | 73,771 (24.4%) | 44,135 (14.6%) | 876 (.29%) |
Table shows the count and percentage of ClinicalTrials.gov records with correctly formatted criteria, missing criteria, and the two most common incorrect formats: incorrect ‘inclusion’ and ‘exclusion’ headers, and non-bulleted criteria.