Literature DB >> 35776915

Analysis of clinical trial registry entry histories using the novel R package cthist.

Abstract

Historical clinical trial registry data can only be retrieved by manually accessing individual clinical trials through registry websites. This limits the feasibility, accuracy and reproducibility of certain kinds of research on clinical trial activity and presents challenges to the transparency of the enterprise of human research. This paper presents cthist, a novel, free and open source R package that enables automated scraping of clinical trial registry entry histories and returns structured data for analysis. Documentation of the implementation of the package cthist is provided, as well as 3 brief case studies with example code.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35776915 PMCID： PMC9249399 DOI： 10.1371/journal.pone.0270909

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

Prospective registration of clinical trials in a public database such as ClinicalTrials.gov or the German Clinical Trials Register, DRKS.de is ethically required of investigators by the Declaration of Helsinki [1], a prerequisite for publication according to the ICMJE [2], and for certain clinical trials, it is mandated by law [3]. A rationale for preregistration of a clinical trial is that it helps to reduce certain biases in the medical literature, such as selective reporting and non-publication bias, however preregistration plays many other roles within the enterprise of clinical research. Reviewers of clinical trial journal publications can use registry entries to ensure that the published record corresponds to the trial that was planned. Clinical trial registries are also used as tools to help enrol prospective patients, providing them with information about new and ongoing studies by location and disease area. Systematic reviewers make use of clinical trial registries to synthesize available clinical evidence. Meta-researchers analyze data from trial registries to describe programmes of human research, evaluate them, or hold clinical investigators accountable to common standards of good research practice. Clinical trial registry entries are often regarded as an immutable record of a trial’s registration, however they can be modified by the responsible party at any time, and changes to clinical trial registry entries between initial registration and last registration are common [4]. Researchers who analyze clinical trial registry data but fail to account for potential changes to registry entries run the risk of making serious methodological errors, such as failure to account for variable follow-up time to an event of interest [5]. Certain research questions and insights into the enterprise of human research are not possible or not feasible to address without a means of accessing and analyzing the history of changes for an entire cohort of clinical trials in a systematic way. For example, analyzing the rates at which clinical trials achieve their originally anticipated enrolment goals would be very difficult without a method for accessing the original clinical trial registry record and the record that was active at the time of completion. Prior to the publication of this package, options for accessing historical trial registry data were limited. While ClinicalTrials.gov does provide an Application Programming Interface (API) that allows access to the most recent version of clinical trial registry entries, the API does not allow access to historical clinical trial registry data, and DRKS.de does not provide an API at all. There are other 3rd-party options to access clinical trial data from DRKS [6], however these do not provide historical data either. Hence, the most straightforward way to access historical versions of a trial registry entry was by manually visiting the website for the clinical trial registry and recording the trial data systematically in a spreadsheet. While this was often feasible when considering a single clinical trial’s history, this method applied to a large cohort of clinical trials is extremely labour-intensive, error-prone and limits reproducibility. The challenge this presents to the feasibility of certain kinds of meta-research has been remarked on elsewhere. (See Al-Durra et al 2020 [7], supplementary appendix L.). Alternatively, a researcher could download the entire database of a clinical trial registry at regular intervals; however this is extremely resource-intensive, both in terms of data storage and the time required to process the data, which also limits the reproducibility of these methods. In order to provide a tool that makes clinical trial registry history research accessible, feasible and reproducible for patients, reviewers, meta-researchers and systematic reviewers, cthist, the R package presented here, was developed. This package provides functions that allow access to historical clinical trial registry data from ClinicalTrials.gov and DRKS.de without bringing about the human error that would result from manual searches and extraction, or the resources that would be required for regularly mass-downloading and processing of the entire registry database. In what follows, the implementation of cthist is presented and its use is described with 3 potential case studies with example code.

Methods

Availability and requirements

The R package cthist can be downloaded from CRAN [8], or the development version can be installed from GitHub (https://github.com/bgcarlisle/cthist). Internal package documentation provides arguments and examples for the included functions. The package was written for R [9], and depends on the following R packages: dplyr [10], httr [11], jsonlite [12], magrittr [13], readr [14], rlang [15], rvest [16], selectr [17], stringr [18], tibble [19], polite [20].

Package implementation

The cthist package provides functions for downloading historical clinical trial registry data for trials registered on ClinicalTrials.gov and trials registered on DRKS.de. Future versions may include functions to download historical clinical trial registry data from other registries. The functions for downloading historical trial data have been implemented for both ClinicalTrials.gov and DRKS.de. As much as possible given the constraints of the source database, the output returned by the function for ClinicalTrials.gov is similar to the output returned by the corresponding function for DRKS.de. Six functions are provided by cthist: clinicaltrials_gov_dates(), drks_de_dates(), clinicaltrials_gov_version(), drks_de_version(), clinicaltrials_gov_download() and drks_de_download(). Three allow downloading of data from ClinicalTrials.gov, and the other three are their counterparts for DRKS.de. These are described in detail below.

clinicaltrials_gov_dates() and drks_de_dates()

The functions clinicaltrials_gov_dates() and drks_de_dates() take a trial registration number (NCT number or DRKS id, respectively) as an argument and provide a character vector of ISO-8601 formatted dates, one for each update to the registry entry history. For ClinicalTrials.gov, these dates are retrieved by downloading the HTML for the index of history changes, selecting all the cells in the “Submitted Date” column from the table of study record versions, extracting the text using the polite R package, and re-formatting the dates. Similarly for DRKS.de, dates are retrieved by downloading the HTML for the change history page, selecting all the cells in the “Date” column from the published versions table, extracting the text using the polite R package, and reformatting the dates. These functions then return a character vector containing dates in the case of success, and in the case of an error (e.g. inability to connect to the internet), they return the word “Error” and print out an explanatory error message in the R console.

clinicaltrials_gov_version()

The function clinicaltrials_gov_version() downloads clinical trial data for the NCT number and version number (starting with 1 as the earliest version) indicated by the function’s arguments. The HTML for the historical version page is downloaded using polite and individual data points are extracted using a combination of cascading style sheet (CSS) selectors and regular expressions (regex). A list of 16 data points are returned by this function: overall status, enrolment, start date, start date precision, primary completion date, primary completion date precision, primary completion date type, minimum age, maximum age, sex, gender based, accepts healthy volunteers, inclusion criteria, outcome measures, contacts and sponsors. See Table 1 for a list of the extracted data points, the CSS selectors that identify the HTML elements in which they are found, and the regular expressions used to extract them. For example, the overall status for the version of the trial in question is drawn from cells in the table with CSS id attribute #StudyStatusBody that have text that matches the regular expression Overall Status: ([A-Za-z,] +). Similarly, data points are extracted using CSS and regex as reported in Table 1 for enrolment, start date, and primary completion date.

Table 1

Cascading Stylesheet (CSS) selectors and regular expressions indicating HTML elements and the text to be extracted from them on ClinicalTrials.gov and DRKS.de by clinicaltrials_gov_version() and drks_de_version(), respectively.

	ClinicalTrials.gov		DRKS.de
	CSS	Regular expression	CSS	Regular expression
Overall status	#StudyStatusBody	Overall Status: ([A-Za-z,] +)	-	-
Recruitment status	-	-	li.state	Recruitment Status: ([A-Za-z, -]+)
Enrolment	#StudyDesignBody	Enrollment: ([A-Za-z0-9 \\[\\]]+)	li.targetSize	[0–9]+
Enrolment type	-		li.targetSize	Planned/Actual: ([A-Za-z]+)
Start date	#StudyStatusBody	Study Start: ([A-Za-z0-9,] +)	li.schedule	[0–9]{4}/[0–9]{2}/[0–9]{2}
Primary completion date	#StudyStatusBody	Primary Completion: ([A-Za-z0-9, \\[\\]]+)	-	-
Primary completion date type	#StudyStatusBody	(\\[[A-Za-z]+\\])	-	-
Closing date	-	-	li.deadline	[0–9]{4}/[0–9]{2}/[0–9]{2}
Minimum age	#EligibilityBody	Minimum Age: ([0–9]+) Years	li.minAge	Minimum Age: ([A-Za-z0-9] +)
Maximum age	#EligibilityBody	Maximum Age: ([0–9]+) Years	li.maxAge	Maximum Age: ([A-Za-z0-9] +)
Sex	#EligibilityBody	Sex: ([A-Za-z]+)	-	-
Gender	-	-	li.gender	Gender: ([A-Za-z] +)
Gender based	#EligibilityBody	Gender Based: ([A-Za-z]+)	-	-
Accepts healthy volunteers	#EligibilityBody	Accepts Healthy Volunteers: ([A-Za-z]+)	-	-
Inclusion criteria	#EligibilityBody	**	-	-
Additional inclusion criteria	-	-	.inclusionAdd	**
Exclusion criteria	-	-	.exclusion	**
Primary outcomes	-	-	p.primaryEndpoint	**
Secondary outcomes	-	-	p.secondaryEndpoints	**
Outcome measures	#ProtocolOutcomeMeasuresBody	**	-	-
Contacts	#ContactsLocationsBody	**	ul.addresses li.address	**
Sponsors	#SponsorCollaboratorsBody	**	-	-

Asterisks (**) indicate table data parsed and encoded as JSON rather than extracted using simple regular expressions. A single hyphen (-) indicates that this data point is not available to be downloaded for this clinical trial registry. The text for start date and primary completion date are transformed from their original format (Month Date, Year) to ISO-8601 (YYYY-MM-DD). In the case that a date is given only as accurately as the month, it would be rounded to the beginning of the month (e.g. “September 2009” would be rounded to 2009-09-01) and the start_date_precision or primary_completion_date_precision column would indicate that it has been rounded to the month, otherwise the relevant column will report that it is accurate to the day. The minimum age, maximum age, sex, gender based, and accepts healthy volunteers data points are extracted from the HTML using the CSS selectors and regular expressions indicated in Table 1. Inclusion criteria are extracted as the contents of the cell of the table with CSS id attribute #EligibilityBody that immediately follows the cell that contains the label “Criteria:”. The lines of text comprising the contents of this cell are encoded as JavaScript Object Notation (JSON) to preserve the data structure. Outcome measures are extracted from rows of the table with CSS id attribute #ProtocolOutcomeMeasuresBody. Each row in the original HTML table that contains an outcome measure is copied to a data frame with three columns: section, label and content. Because section headings are encoded in the original HTML as table rows where the second cell contains no text, the section is the text of the most recently preceding row where the second cell is empty. The label is the text in the first cell of the row. The content is the text contained in the second cell of the row. The data frame is encoded as JSON to preserve the data structure. Contact information is extracted from rows of the table with CSS id attribute #ContactsLocationsBody that come before a row labelled “Locations:”. Each row in the original HTML table that contains contact information is copied to a data frame with two columns: label and content. The label is the text in the first cell of the row. The content is the text contained in the second cell of the row. This data frame is encoded as JSON to preserve the data structure. Sponsors and collaborators are extracted from rows of the table with CSS id attribute #SponsorCollaboratorsBody. Each row in the original HTML table that contains sponsor or collaborator information is copied to a data frame with two columns: label and content. The label is the text of the first cell of the row, and the content is the text of the second cell. This data frame is encoded as JSON to preserve the data structure. For all columns that are encoded as JSON, these can be converted back to data frames using the fromJSON() function implemented by the R package jsonlite. Upon successful download and parsing of a clinical trial registry history entry, the function returns a named list of these 16 data points. In case of error, the text “Error” is returned so that clinicaltrials_gov_download(), which calls this function, can continue in the case of an error in downloading, while still indicating which rows need to be downloaded again, to aid in downloading large data sets.

drks_de_version()

The function drks_de_version() downloads clinical trial data for the DRKS number and version number (starting with 1 as the earliest version) indicated by the function’s arguments. The HTML for the historical version page is downloaded using polite and individual data points are extracted using a combination of cascading style sheet (CSS) selectors and regular expressions (regex). A list of 13 data points are returned by this function: recruitment status, start date, closing date, enrolment, enrolment type, minimum age, maximum age, gender, additional inclusion criteria, exclusion criteria, primary outcomes, secondary outcomes and contacts. Because the data provided in the DRKS.de historical version page do not have a one-to-one correspondence with their counterparts on ClinicalTrials.gov, the columns extracted from the two registries differ. See Table 1 for a list of the extracted data points, the CSS selectors that identify the HTML elements in which they are found, and the regular expressions used to extract them. For example, the recruitment status for the version of the trial in question is drawn from the bullet point with CSS selector li.state, taking the text that matches the regular expression Recruitment Status: ([A-Za-z, -]+). Similarly, data points are also extracted using CSS and regex as reported in Table 1 for start date, closing date, enrolment, enrolment type, minimum age, maximum age, and gender. Additional inclusion criteria, exclusion criteria, primary outcomes and secondary outcomes are taken from HTML elements with CSS selectors.inclusionAdd,. exclusion, p.primaryEndpoint and p.secondaryEndpoints respectively, encoded as JSON to preserve the data structure. Contact information is returned as a data frame with columns label, affiliation, telephone, fax, email, url. Each row in this data frame is populated by one bullet with CSS selector ul.addresses li.address from the original HTML. The label column is extracted from the HTML node. The affiliation column is extracted from the bullet with CSS selector li.address-affiliation. The telephone, fax, email and url columns are extracted from the bullets with CSS selector.address-telephone, .address-fax, .address-email and.address-url, respectively excluding the label text for each. The data frame is encoded as JSON to preserve the data structure. Upon successful download and parsing of a clinical trial registry history entry, the function returns a list of these 13 data points. In case of error, the text “Error” is returned so that drks_de_download(), which calls this function, can continue in the case of an error in downloading, while still indicating which rows need to be downloaded again, to aid in downloading large data sets.

clinicaltrials_gov_download() and drks_de_download()

The functions clinicaltrials_gov_download() and drks_de_download() loop through the trial registry numbers provided to them in the first argument (NCT numbers or DRKS id’s, respectively), and for each one, download a list of all the dates on which the trial registry entry was updated, using clinicaltrials_gov_dates() or drks_de_dates(), respectively (described above). For each version of each trial, the functions download the clinical trial registry entry version using clinicaltrials_gov_version() or drks_de_version(), respectively. Each downloaded registry entry historical version is written to the filename specified in the second argument to the function, formatted as a CSV. If no filename is specified, the function will return a data frame containing one historical version of a clinical trial per row. In the case of connexion failure or server error when downloading a version of a trial record, only the text “Error” as described above, will be reported in the overall status or recruitment status column for that version. If a filename is specified, the function will return TRUE in the case that all rows were downloaded without reporting an error and FALSE in the case that an error was detected. A FALSE return alerts the user to re-run the function with the same arguments in order to remove the rows that contain errors and re-download them. On running the function again with the same arguments, it removes rows that have been marked “Error” and tries to download them again.

Results

The following is a description of 3 case studies of research questions regarding clinical trial registry entries that are difficult or un-feasible to answer without a tool for mass-downloading clinical trial registry history data. Case studies 1 and 3 provide example code for analyzing clinical trials from ClinicalTrials.gov and example 2 provides example code for DRKS.de, however any of the three could be re-written easily for the other clinical trial registry with minor changes.

Case study 1: Assessing change in length to recruitment period

Changes in recruitment length to a clinical trial have been used as one part of a measure of the feasibility of clinical trials [21]. In order to evaluate changes to recruitment period lengths for a cohort of clinical trials, it is necessary to mass-download enrolment data from historical versions of ClinicalTrials.gov records. As described in R Code Box 1, clinical trials meeting the study’s inclusion criteria were downloaded by performing a search via the web front-end of ClinicalTrials.gov. Search results were saved as a comma-separated value (CSV) file named SearchResults.csv. The NCT Number column in this file is parsed and used as an argument for the clinicaltrials_gov_download() function implemented in package cthist, which then downloads the records for all the clinical trial registry entries in this sample and saves them as historical_versions_1.csv. This script parses these results and the percentage change of the length of the recruitment period between the first registry entry version posted with an overall status of “Recruiting” and the registry entry version that was active 1 year later.

## R Code Box 1

library(tidyverse) library(cthist) library(lubridate) ## To reproduce these methods, conduct a search on the ## ClinicalTrials.gov web front-end and download the result as a ## comma-separated value (CSV) file named `SearchResults.csv`. Save ## this file to your R working directory. This file provides `cthist` ## with the list of NCT numbers for which to download historical trial ## registry data. ## Read the downloaded CSV into memory, and select only the `NCT ## Number`column trials <- readr::read_csv("SearchResults.csv") %>% select(`NCT Number`) %>% pull(`NCT Number`) ## Download historical clinical trial data for all NCT numbers in the ## CSV downloaded from ClinicalTrials.gov clinicaltrials_gov_download(trials, "historical_versions_1.csv") ## Read historical versions from the data in the CSV generated above hv <- read_csv("historical_versions_1.csv") ## Define follow-up time as 1 year followup <- lubridate::years(1) ## We will define the start date at launch and completion date at ## launch to be the start date and completion date reported on the ## first version of the trial where that version’s overall status is ## listed as "Recruiting" ## To obtain these dates, we will consider only versions with an ## overall status of "Recruiting" and a non-NA start and completion ## date; we will filter for the first row (corresponding to the ## earliest version), and select only the NCT number and the start and ## completion dates, which we will rename `launch_start_date`and ## `launch_completion_date` launch_dates <- hv %>% filter( overall_status = = "Recruiting" & ! is.na(study_start_date) & ! is.na(primary_completion_date) ) %>% group_by(nctid) %>% slice_head() %>% ungroup() %>% select(nctid, study_start_date, primary_completion_date) %>% rename(launch_start_date = study_start_date) %>% rename(launch_completion_date = primary_completion_date) ## Join the original start dates to every row of each trial and remove ## trials where there was no historical version posted after the trial ## started hv <- hv %>% left_join(launch_dates, by = "nctid") %>% filter(! is.na(launch_start_date)) ## We will define the start date at follow-up and the completion date ## at follow-up to be the start date and completion date reported on ## the version of the trial that was active at the number of days ## after the start date at launch specified in the `followup`variable ## To obtain these dates, we will consider only versions with a date ## that is less than 1 year after the launch start date and where the ## start and completion dates are not NA; we will filter for the last ## row (corresponding to the latest version), and select only the NCT ## number and the start and completion dates, which we will rename ## `start_date_fup`and `completion_date_fup` dates_fup <- hv %>% filter( version_date < = launch_start_date + followup & ! is.na(study_start_date) & ! is.na(primary_completion_date) ) %>% group_by(nctid) %>% slice_tail() %>% ungroup() %>% select(nctid, study_start_date, primary_completion_date) %>% rename(start_date_fup = study_start_date) %>% rename(completion_date_fup = primary_completion_date) ## Join the start and end dates as reported at launch and at follow-up trial_dates <- launch_dates %>% left_join(dates_fup) %>% mutate( recruitment_length_at_launch = launch_completion_date– launch_start_date ) %>% mutate( recruitment_length_at_fup = completion_date_fup– start_date_fup ) %>% mutate( recruitment_length_change = paste0( round( 100 * as.numeric(recruitment_length_at_fup) / as.numeric(recruitment_length_at_launch) − 100, digits = 0 ), "%" ) ) %>% select( nctid, recruitment_length_at_launch, recruitment_length_at_fup, recruitment_length_change ) ## Write result as a CSV to disk trial_dates %>% write_csv("trial_dates.csv") Nearly identical methods to the above were applied to a cohort of SARS-CoV-2 treatment and prevention efficacy trials that were initiated between 2020-01-01 and 2020-06-30, downloading historical clinical trial versions using cthist [21]. One goal of this study was to assess the feasibility of clinical trials where part of the definition of feasibility was that an ongoing trial may be unfeasible if it is ongoing, but its recruitment period has been extended to at least twice as long as the original anticipated length in the version of the ClinicalTrials.gov record at the time of trial start. Assessing a large cohort of clinical trials for changes to recruitment period length would have been labour-intensive, error-prone and difficult to reproduce without the use of an automated tool for downloading clinical trial registry histories.

Case study 2: Identifying changes to outcome measures

Outcome switching in clinical trials is a common practice [22] in which the outcomes that were pre-specified in a clinical trial registry differ from those that are published in the corresponding journal publication. Unreported outcome switching may mislead readers or introduce bias [23]. Because a clinical trial registry entry may be updated at any time, it is necessary to consult not just the most recent version of a clinical trial registry entry, but to review all the versions in the trial registry history in order to determine whether, when, and in what manner they were changed. The code presented in R Code Box 2 will take a list of DRKS id’s from a CSV downloaded from the web front-end of DRKS.de, download all the historical versions of those trials and determine which updates to each clinical trial represents a change in the trial’s outcome measures. This script will identify all changes, from ones as major as the malicious switching of primary and secondary outcomes, to ones as minor as the correction of typos, or even a single-character white-space change. The work of determining whether the change should be mentioned in a final journal publication remains to be done by human curation of the script’s output data.

## R Code Box 2

library(tidyverse) library(cthist) ## To reproduce these methods, conduct a search on the DRKS.de web ## front-end and download the result as a comma-separated value (CSV) ## file. The CSV download option on DRKS.de produces a zipped ## semicolon-delimited data file, which must be unzipped and saved to ## your R working directory as `trials.csv`before reading. This file ## provides `cthist`with the list of DRKS id’s for which to download ## historical trial registry data. trials <- readr::read_delim("trials.csv", ";") %>% select(`drksId`) %>% pull(`drksId`) ## Download historical clinical trial data for all DRKS id’s in the ## CSV downloaded from DRKS.de drks_de_download(trials, "historical_versions_2.csv") ## Read historical versions from the data in the CSV generated above hv <- read_csv("historical_versions_2.csv") ## Identify "run lengths" (the number of rows that contain specified ## columns of equal value) for outcomes measures within a trial; This ## "run lengths" object will allow us to number each "run" of outcomes outcome_runs <- rle( paste(hv$drksid, hv$primary_outcomes, hv$secondary_outcomes) ) ## Make an `outcome_run`column that assigns a number to each "run" of ## identical outcomes in the original historical versions data frame hv <- hv %>% mutate( outcome_run = rep( seq_along(outcome_runs$lengths), outcome_runs$lengths ) ) ## Group by "runs" of outcomes and select only the first of each. This ## will produce a data frame with one row per "run", indexed by the ## NCT number and the date on which the outcome measures changed outcome_changes <- hv %>% group_by(outcome_run) %>% slice_head() %>% ungroup() %>% select(drksid, version_date, primary_outcomes, secondary_outcomes) ## Write result as a CSV to disk outcome_changes %>% write_csv("outcome_changes.csv") These methods have been applied to an ongoing study to identify changes to a clinical trial’s outcomes as reported on ClinicalTrials.gov or DRKS.de at key time points in the course of each trial and after its completion [24]. Changes between versions of a registry entry that are identified automatically by cthist will be assessed manually by human raters to determine whether this represents a meaningful change to the outcomes, such as an added or modified outcome measure, and if so what kind, or a minor cosmetic change (e.g. correcting a typo). While the methods for this project are not fully automated, it is only feasible to do because of the use of cthist, which identifies the trials and even the versions that need to be scrutinized by human raters. It would have been impractical to manually access the clinical trial registry histories for all 1897 trials in the study sample and check for changes in outcome measures by hand.

Case study 3: Correcting for variable follow-up time

Let us consider the hypothetical case of a meta-researcher who wishes to characterize phase 3 glioblastoma clinical trial activity in terms of how many clinical trials are stopped (overall status changed to “Terminated”, “Suspended” or “Withdrawn”). It may be tempting to search on the ClinicalTrials.gov web front-end for all or phase 3 trials with an indication of “glioblastoma” whose overall status is “Terminated”, “Suspended” or “Withdrawn”, count the results and report them as a fraction of the number of results for the same search without the overall status restriction. As of the writing of this manuscript, 17.2% phase 3 glioblastoma trials on ClinicalTrials.gov (16 out of 93 on 2022-01-05) had an overall status of “Terminated”, “Suspended” or “Withdrawn”. This strategy does not account for variable follow-up time among the clinical trials in the sample. A trial that was registered yesterday, for example, may yet go on to be withdrawn, if it were given the same follow-up time as the trials in the sample that were registered five years ago. By failing to account for variable follow-up, our hypothetical researcher is systematically introducing bias into their sample and may produce a misleading count of the number of stopped trials. To correct for this, the overall status of every trial in the sample must be assessed at the same follow-up time; trials where the requisite follow-up time has not yet passed must be excluded from analysis. The script in R Code Box 3 will download historical clinical trial data for the trial numbers specified in the CSV downloaded from a search using the web front-end of ClinicalTrials.gov. Trials with less than five years of follow-up will be removed from the sample, and the script will write a CSV file to disk that includes the overall status of each eligible trial at five years after the initial registration.

## R Code Box 3

Discussion

The purpose of prospective registration of clinical trials is partly vitiated if there is no efficient means to access historical clinical trial registry data. The responsible party for a clinical trial can effectively “bury” changes to a registry entry in its history if there is no feasible way to access history changes. Peer-review of individual clinical trial journal publications does not provide sufficient scrutiny to ensure that the publication is not misleading or biased due to outcome switching, etc. as trial registration records are not always thoroughly checked for accuracy [25]. While the disclosure of changes to individual trial registry entries through the registry website provides some level of openness, this only allows researchers to find registry entry changes if they already know where to look, and limits the feasibility or reproducibility of certain kinds of research. The cthist package provides functions that allow for efficient mass-downloading and processing of historical clinical trial registration data from ClinicalTrials.gov and DRKS.de. This makes certain kinds of meta-research feasible and provides a means to correct common errors in data collection and analysis, such as overlooking variable follow-up. This package also increases the reproducibility of previously completed analyses of clinical trial registry data. Among analyses of clinical trial registry entries, it is common practice to report the date that the clinical trial registry was searched, as the database’s contents change frequently. Without a way to select data from a specific date as provided by cthist, reproducing this kind of research is difficult or impossible.

Limitations

The R package described here is a web-scraper that provides a means for retrieving historical clinical trial data that is not easily available without extensive manual work. Web-scraping a clinical trial registry differs from accessing it through an API in that an API is designed to be queried repeatedly and receive a large volume of requests from automated programmes, whereas a web-scraping tool has opportunistically repurposed a resource that was originally designed only to be accessed by individuals manually through a web browser. This means that ClinicalTrials.gov or DRKS may change their websites at any time in ways that alter CSS selectors on which this web-scraper depends, for example [26]. An effort has been made in the design of this tool to respect server requests regarding the volume and frequency of queries by implementing server calls using the polite R package [20], however there is a risk that ClinicalTrials.gov or DRKS may implement changes at any time that intentionally or unintentionally break the functionality of this package. Further, not all data points that are available in ClinicalTrials.gov or DRKS.de are collected by cthist, although the package is easily extensible to collect anything reported on the historical version page.

Future directions

The WHO registry network lists 17 primary clinical trial registries other than ClinicalTrials.gov (including DRKS.de) [27]. Future versions of this package may include functions for downloading historical clinical trial registry data from other clinical trial registries that provide access to historical versions through their website. Future versions may also include functions for downloading additional data points from ClinicalTrials.gov and DRKS.de. This R package may also be integrated into an automated tool to generate reports for reviewers of clinical trials in partnership with journals who publish clinical trials results. This report could include key information on a clinical trial based on data extracted by cthist to assist in their reviews. A prototype of such an application is available [28]. It is also my hope that the existence of this R package may draw attention to its necessity by those who make design decisions for clinical trial registries, and implement means for mass-downloading historical clinical trial data for analysis that do not require the use of this R package. 14 Mar 2022

PONE-D-22-03050

Analysis of clinical trial registry entry histories using the novel R package cthist

PLOS ONE Dear Dr. Carlisle, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. First of all, I would like to thank all the 4 reviewers. They were fast in answering and provided a lot of minor but very constructive points. Please address all these points very carefully. I also want to say that based on my own review, I think that the paper is sound and that it will be very helpful for meta-researchers involved in this field as it details a very important R library. I will be therefore very pleased to see a revised manuscript. I have 2 additional comments :

- Please add a few words about the main limitations of the library in the abstract ;

- Please detail with more details any plan you have to update the library in the future ;

My decision is major revision 1/ because of the large number of minor revisions requested by the reviewers and 2/ because after this first round of peer review, I will send it again to the reviewers. In order to be transparent, please note that I was aware that one of the 4 reviewer had a potential (non-financial) conflict of interest. However, I was pretty sure that his input would be very helpful in order to strengthening the manuscript and I have asked him to review it. Please note that I did not take his "minor revision" suggestion into account when making my decision. For this reason, I invited 4 reviewers and only based my decision on the 3 remaining reviewers. However, as you will see, inter-reviewers agreement was really good.

Please submit your revised manuscript by Apr 24 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Florian Naudet, M.D., M.P.H., Ph.D. Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We noted in your submission details that a portion of your manuscript may have been presented or published elsewhere. [This publication is available as a preprint: https://www.medrxiv.org/content/10.1101/2022.01.20.22269538v1] Please clarify whether this [conference proceeding or publication] was peer-reviewed and formally published. If this work was previously peer-reviewed and published, in the cover letter please provide the reason that this work does not constitute dual publication and should be included in the current manuscript. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: N/A Reviewer #3: N/A Reviewer #4: N/A ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This is an interesting paper that presents an R package that performs web scraping to get historical data on clinical trials from two registries : clinicaltrials.gov and DRKS. The code source of the package is open source on GitHub. Some parts deserve a more elaborate description: - the clinical trials API is not mentioned at all: from what I understand, the API does not provide the historical data and web crawling is the only way to get it, but still this should be clarified in the paper. The choice to use web crawling techniques has to be justified as using an API would be a much simpler way to proceed if it would be possible - web scraping is a very useful technique but should be done in a "polite" way, to make sure not to disturb the underlying service / website. The way the package is used depends on the final user of course but this issue could be brought to the attention of the reader - The 3 cases study are very interesting applications of the package. But only the third comes with results explaining the benefits of the historical analysis. At least some light results could be provided also for the first two cases study. - in Discussion / Future directions: clinicaltrials.gov and DRKS do provide access to webpages describing the historical changes : that may not be the case for other registries so this method may not be extended as such to any registry. Reviewer #2: First of all, I would like to thank the editor for the invitation to review this paper and draw my attention to this interesting R package. I think the package is a valuable addition to the R package ecosystem, as there are only very few, if any, packages that allow downloading clinical trial data (some older packages have been archived). Additionally, as the author has correctly pointed out, the usual methods for downloading trial data (e.g., the AACT) do not include the history of changes to a registry entry. The package itself could be described as rather minimalistic but worked well during my testing, and the paper describes the included functions quite well. The case studies illustrate briefly but convincingly how the package enables analyses that would otherwise require considerable manual work. I could successfully run the code for all case studies and briefly tested the Shiny app, which also worked well. However, after reviewing the code and the article, I have some comments. Comments regarding the package and code: Comment: When I first tried the clinicaltrials_gov_dates function, it returned some correct dates but many NAs. It turned out that this was due to my language setting so that only some month names could be read in. I could avoid this error by temporarily changing the locale: lct <- Sys.getlocale("LC_TIME") # backup Sys.setlocale("LC_TIME", "C") Maybe the function could do this automatically (and restore the original locale afterwards)? This will otherwise lead to frequent problems. Comment: The function should probably also warn if it returns NAs. Comment: I realized the above when running the code line-by-line. I noticed that the line 'format("%Y-%m-%d")' in clinicaltrials_gov_dates seems to be superfluous, but maybe it is meant as a 'double-check' to make sure the dates are formatted correctly. Comment: This comment is about both the article and the code. I think it is important to correctly name the returned value of clinicaltrials_gov_version (and the respective function for DRKS). The article often mentions 'lists' or 'data frames' (e.g., line 160), and the help for clinicaltrials_gov_version defines the returned value as a list, although it is actually a character vector. Comment: In connection with the previous comment, I believe that it might indeed be better to let that function return a list instead of a character vector. First, lists are printed more nicely in the R console, and second, if the elements of the list were named appropriately (with 'outcomes', 'criteria', and so on), we could use $ to subset the returned list. Does returning the result as a character vector have any advantages? Comment: I have tested the download function for ClinicalTrials.gov with 100 trials for a certain query and also the 93 trials on glioblastoma, as demonstrated in case study 1. However, all values for gender_based were missing. Is this plausible? Comment: It is good that the author has written unit tests for cthist. However, I think the tests are currently quite basic and could easily test the output more thoroughly. For example, test-clinicaltrials_gov_version.R could test for missing values and correct date format (or even some specific values), instead of only testing the length of the vector. Comment: It would be nice if ?cthist had a help page with a short overview of the package. Comment: ClinicalTrials.gov has announced an update to the website. Do you know if this will interfere with the scraper's functionality? Luckily, the beta version of the history page looks identical to the current version at first glance. Comment: It seems a bit unusual to me that some of the functions, such as the *_dates functions) return "Error" and "Warning" as character vectors upon failure and print the warnings and errors using message(). I assume the intention is not to interrupt longer jobs if only very few tasks fail, but again, this is somewhat unusual because a character vector is the returned data format upon success. Typically, the warning() and stop() functions would be used here. Maybe the functions could have a parameter to let the user choose the behavior? I'm not sure, to be honest. Comment: The returned data class from the *_dates functions is a character vector, but the help states that it is a date vector (which would be better, in my opinion). Comment: Are you aware of any limits on automatic requests to ClinicalTrials.gov or DRKS.de? Some sites block web scrapers after a certain amount of requests. I don't think this is the case with these two sites, but it would be good to confirm that. Some users might download a very large number of trials. Comments regarding the article: Comment: The case studies are short, but convincing. It would be good to show or mention at least some of the results for every case study instead of leaving this up to the reader. Comment: Maybe it should be clearly mentioned that 'version 1' is the oldest version of a trial entry. It would be good also to say this on the R help pages. Comment: The package focuses on the history of registry entries, but I could imagine it being used for the general downloading of current entries. Since the rclinicaltrials package has been archived, there is no package with this functionality on CRAN, as far as I know. It would be helpful to briefly describe how to download only the most current entries. A 'trick' for doing so using slice_tail is included in one of the code examples but could be missed by readers easily. Comment: Line 114 states "enrolment type (“Anticipated” or “Actual”), which are split into separate columns by the clinicaltrials_gov_download() function". It seems that actually 'Anticipated' or 'Actual' are included after the enrolment value in element 2 and additionally in element 5 of the returned vector. I assume element 2 should be only the numerical value? Comment: Is there a way to get the text 'cleaned' from line break code etc.? Could this be done by the package or are there other R packages that are better suited for that? Most users will probably want to clean these entries. A solution could be briefly mentioned. Comment: I could successfully convert the JSON values that are returned by some of the functions. However, the article mentions in several places that, e.g., the outcome measures are 'data frames', although it seems to me R will convert the JSON to lists (when using rjson). Is this correct? I think returning an actual data.frame as a list element would actually be a more user-friendly solution than the current one, but I assume this was done to keep the output identical to the JSON entries in the CSV generated by the download function. Anyway, the article should be clear that the package does not return data frames or how the returned values could be converted to data frames. Comment: Line 147: Maybe mention that 'Percentage change' refers to the percentage change of the length of the recruitment period. Comment: Some references are missing DOIs. I also think that the citations of R packages should contain a link to CRAN, if applicable. Reviewer #3: The author present a new package for the R statistical programming language, aimed as streamlining web scraping of two major clinical trial registries : ClinicalTrials.gov and DRKS.de. This package help retrieving change history of applications, a function which is not present in these registries API. It hints at an inclusion in reviewing of clinical trial, which would indeed be a useful complement to reported protocol changes. In the discussion, the author press that the existence of this tool may reveal it's additional value, which is an important point. This package is self-contained and usable without disturbance in multiple R paradigms. A version which do not request disk access and is able to operate fully on RAM may be prefered but is functional as is. A) Regarding the article : A.1) Abstract : I may advise to change "mass downloading" to "scraping", which is more precise while hinting at volume tolerance of the method. I am also surprised by the expression "the enterprise of human research", which I am not sure is widely used in english. As I am not a native speaker, this is just a question. A.2) In the introduction, the author write "Certain research questions [...] are not possible or not feasible to address without a means of accessing and analyzing the history of changes for an entire cohort of clinical trials in a systematic way." I would be pleased to be presented with one or two examples. A.3) Again in the introduction, the author write "There is no API access to historical clinical trial registry data for ClinicalTrials.gov or DRKS.de." As there is in fact an API for both of these registries, I would have like a more precise description of the limitation of theses API and the need of this webscraping package. When visiting ClinicalTrials.gov API webpage, it call up the Clinical Trials Transformation Initiative (CTTI)'s Database for Aggregate Analysis of ClinicalTrials.gov (AACT). Maybe add a sentence to discuss the objective of this base and the limitations motivating the creation of the cthist package ? A.4) In the method part, the details concerning the *_version functions seems shared between both, and the paragraphs have many repetition making reading it slightly laborious. Does the integration of more technical parts in the table 1 make in fact for a more or less concise and reader-friendly presentation ? Complementary, regexes are mostly simple, but as a "write only" language it may be best to systematically add a translation in plain english. As there is mostly 3 different regexes, they may be substituted in the text by the plain english translation. A.5) Again in the method part, I would like to have more precision on the two following affirmations : "These functions also implement automated error-checking on completion, in order to ensure the accurate retrieval of large sets of clinical trial registry entry histories. If any version of the clinical trials to be downloaded returned an error, these functions return FALSE, otherwise they return TRUE." and "These functions also implement automated checking for already-downloaded data when starting, to allow for re-starting partially completed downloads." Which further details on what type errors are checked and to what extend : do the function only return the website error ? Are errors sources furthermore explored ? etc. For the second and from what I read in the code, it only check if a data is already present and complete, accelerating complementary download more than explicitly "allowing" it ? A.6) For Case study 3, even as it is just an example, why not keep trials which and describing status completion as completed, terminated or censored at extraction date (and statut regarding the exceeding of forecasting end date) ? A.7) In the discussion part : As this package is also aimed at researchers not accustomed with web scraping, I would advise to add one or two paragraphs introducing to the basics of web scraping : differences between using an API and scraping a page, the risk of brutal changes and end of service, websites and other users respect regarding requests volume and frequency, risk of IP ban and other countermeasures, etc. B) Regarding the code examples : B.1) For all code examples, adding a (succinct) comment on each line code may be useful for readers not accustomed with the pipe workflow and tidyverse-specific function. B.2) This phrase is really hard to read and understand until later in the code, and may benefit of a refactoring centering on objectives in the style of "extract initial definitions of start and completion date" "## Define the start date at launch and completion date at launch to be ## the start date and completion date reported on the first version of ## the trial where that version's overall status is listed as ## "Recruiting" " B.3) In the pipe `readr::read_csv("SearchResults.csv") %>% select(`NCT Number`) %>% pull() `, using `pull()` without argument is at risk. `readr::read_csv("SearchResults.csv") %>% pull(`NCT Number`)` should be more reliable. B.4) The definition of "1 year" beeing variable in number of days and `followup <- 365` carrying a risk of error, it may be beneficially replaced by `dmy("21-01-2021") + lubridate::years(1)`, taking care of leap year. B.5) In code box 2, the use of an external rle object may not be apparent for readers. The author may add an reason for the use of run length, or maybe go for a whole-in-one approch as in : ``` hv %>% mutate(outcome_run = paste(drksid, primary_outcomes, secondary_outcomes) %>% {. != lag(.)} %>% coalesce(TRUE) %>% cumsum() ) %>% group_by(outcome_run) %>% slice_head() %>% ungroup() ``` B.6) In the code box 3, it is possible to additionally reduce redundancy by using `stopped = overall_status %in% c("Suspended", "Terminated", "Withdrawn")` in line 491. C) Regarding the package in itself : As a first point, I wish to emphasize the global quality of the package design and the use of dependency for import of other packages. The following comments may in no point influence the publication decision as they relate to the package in itself and not the article presenting it. The three major limitation that I see as of now are : C.1) The lack of defensive programming in the current state, with no testing of function argument and thus no meaningful error code and no catch of error before handling to the registry website. As NCT numbers format is strictly defined, adding verification that a character vector of length one and corresponding to a defined regex may be a critical add, ensuring that calls like `clinicaltrials_gov_version("bad_id", "2,8")` or `drks_de_version("DRKS0000000000005219", glm(Species ~ Sepal.Length, data = iris))` have informative errors. C.2) For all extracted dates, filling in missing data may be misleading in creating false precision. I would advise to either : issue a warning or, preferably, keep dates as plain text and let the user deal with imprecision in day number in the way that is more effective in the specific study context (first day, last day, interval, discard, etc.) C.3) By the actual design, the *_download() functions request a writing disk access, which : 1) may be difficult in some secure analysis environnement, and 2) is a deviation on the general R objective of "no side effect". For the cleanest effect, it may be preferable to rewrite the *_download() to only batch call the *_version() functions and let the user choose how to keep its data and control it's storage. A simple alternative may be to change `output_filename` default value to `output_filename = tempfile()`, storing values in RAM, and make the function return the data frame (invisibly if needed) in place of an indication of success or failure. Additional points that may streamline the package starting and usage : C.4) When screening the code, the *_version() and *_dates() function seems to be backend for by *_download() function. If they are not intended to have separated use, they may be keeped as internal function and only *_download() function may be exported : https://r-pkgs.org/namespace.html . This way there will also be less function help-pages to write. In addition, *_version() function are the ones actually downloading the informations and return it in R and *_download() function map *_version() functions and write it to the disc, which is slightly misleading. C.5) If *_version() function are keeped exposed, adding a default version number value may be useful, for example `versionno = 0`, and defensive programming added to it too. C.6) The help files may be slightly thin, and adding one of the article example may be of great benefit, especially the explanation of how to get a csv of search result and not having to manually copy the individual NCT numbers. If it is simple enough, do the author plan to add a function taking a research string as an argument and returning the NCT numbers ? C.7) To avoid IP ban or accidental DOS attack, is there a timer between two requests by *_version() functions ? Adding one, setting it as arbitrarily large (for example 1 second) and presenting it as an argument would be useful at the same time as warning against error and as teaching. Reviewer #4: Many thanks for the opportunity to review this piece explaining a new R package to examine historic trial information at ClinicalTrials.gov and the DRKS. The purpose of this package is well described and justified. Working with historic registry data is something I have had to do, and write ad hoc code to handle (though in Python for my cases), for a number of my projects. Creating a package that handles at least some of these aspects of this type of task is certainly welcome and valuable. I have separated my comments into three parts covering my use of the package, the code, and the text of the manuscript. I note some minor points all around. Comments on Usage: While R is not my primary language I was able to install the package from CRAN and work through the various functions. For the author’s reference, I was working in R v4.0.5 within RStudio v1.4.1106 on a Mac running 10.15.17. I cross checked outputs with a few sample trials from each registry and found no discrepancies. I admit, my lack of experience in R likely limited my ability to play around with the data in more depth, but the outputs are all very simple. It’s either a vector of dates, or a character vector containing strings of the variables in the relevant fields that can be easily outputted to CSV. The more complex fields are organized and tagged logically (in a way similar to a Dictionary in Python). One note is that when I ran the example function in the documentation for downloading multiple ClinicalTrials.gov records: `clinicaltrials_gov_download(c("NCT02110043", "NCT03281616"), "test_review.csv")` it worked just fine to actually download the data and output the file but I got a parsing error on each row in which it expected 18 columns and got 19. The output is copied below. This might be unavoidable for some technical reason, but if that is the case perhaps note this in the documentation so that people aren’t thrown off by the errors during use. The DRKS example did not do this. clinicaltrials_gov_download(c("NCT02110043", "NCT03281616"), "test_review.csv") 022-03-09 12:49:00 NCT02110043 processed (8 versions, 50%) 2022-03-09 12:49:06 NCT03281616 processed (2 versions, 100%) Warning: 10 parsing failures. row col expected actual file 1 -- 18 columns 19 columns 'test_review.csv' 2 -- 18 columns 19 columns 'test_review.csv' 3 -- 18 columns 19 columns 'test_review.csv' 4 -- 18 columns 19 columns 'test_review.csv' 5 -- 18 columns 19 columns 'test_review.csv' ... ... .......... .......... ................. See problems(...) for more details. [1] TRUE Warning messages: 1: Unnamed `col_types` should have the same length as `col_names`. Using smaller of the two. 2: Unnamed `col_types` should have the same length as `col_names`. Using smaller of the two. Otherwise, I think this is all relatively straightforward and anyone with familiarity working in R should be able to get from these simple functions to workable data relatively easily. Comments on the Code: Once again, slightly limited by my knowledge of R here, but I did take a look over the code for the package on GitHub. Since, as the author notes, there is no API for either registry that can be used for historic data, this package essentially runs a scraper in the background to gather the relevant data from the history pages. This is noted in the limitations of the paper, but you might also want to include in your documentation so people can understand that very large queries might cause issues at the host registry. While I’m not in a position to comment on any by-line bugs or enhancements, given my own experience scraping historic registry data, I was able to follow how each function broadly worked, including how the `download` functions call the other functions to operate and it all made sense from an implementation perspective. I especially like the feature to re-check for partial downloads to not waste time and pings of the registry if a scrape is interrupted. I was able to get all the code examples to run and produce expected results on my own. Comments on the Text: Since you have a table that matches each extracted field to it’s CSS selector, I think you can simply point to that rather than also write it out in words from lines 108-163 and 175-206. If you want to keep details of any context you feel is necessary about the extractions in prose, that is fine but there’s no need to keep repeatedly referencing the CSS selectors and HTML elements used when the Table does that much more efficiently for anyone who is interested. Otherwise it really hinders the readability of that section. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Eric Jeangirard Reviewer #2: No Reviewer #3: Yes: Scanff, Alexandre Reviewer #4: Yes: Nicholas J. DeVito [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 27 Apr 2022 Please see attached file, `Response to Reviewers.docx` Submitted filename: Response to Reviewers.docx Click here for additional data file. 7 Jun 2022

PONE-D-22-03050R1

Analysis of clinical trial registry entry histories using the novel R package cthist

PLOS ONE Dear Dr. Carlisle, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. I would like to thank the 4 reviewers. I have no additional comment. Please submit your revised manuscript by Jul 22 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Florian Naudet, M.D., M.P.H., Ph.D. Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed Reviewer #3: (No Response) Reviewer #4: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: N/A Reviewer #3: N/A Reviewer #4: N/A ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes Reviewer #4: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: (No Response) Reviewer #2: The author has made extensive changes to both the paper and the package. Both are in a much-improved state compared to the previous submission. My comments have been addressed sufficiently, and I would like to thank the author for his detailed responses. In my opinion, the article is now acceptable for publication. Reviewer #3: I would like to thanks the author for his responses and clarifications. The context and the workflow benefit from more in-depth description and the integration of the "polite" package is a great quality improvement of the cthist package. I now only have two tiny comments : - In response for the inquiry about DRKS API : I had found this aggregator service via clarivate https://www.cortellislabs.com/page/?api=api-CLI . It may not be in fact sufficient for the intended use, but may be worth mentioning. - The code comment referred in B.2) "## Define the start date at launch [...]" is mentioned as edited but have no tracked change. Have one of the intended change been slipped away between versions ? Thanks again for this new tool in the R environment and the opportunity to review this package Reviewer #4: (No Response) ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Eric JEANGIRARD Reviewer #2: No Reviewer #3: Yes: Alexandre Scanff Reviewer #4: No ********** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

17 Jun 2022 Reviewer #3: I would like to thanks the author for his responses and clarifications. The context and the workflow benefit from more in-depth description and the integration of the "polite" package is a great quality improvement of the cthist package. I now only have two tiny comments : - In response for the inquiry about DRKS API : I had found this aggregator service via clarivate https://www.cortellislabs.com/page/?api=api-CLI . It may not be in fact sufficient for the intended use, but may be worth mentioning. • A reference to the 3rd-party API has been added to the manuscript. See lines 51-54. - The code comment referred in B.2) "## Define the start date at launch [...]" is mentioned as edited but have no tracked change. Have one of the intended change been slipped away between versions ? • A comment beginning “## Define the start date at launch” appears twice in the manuscript. In both cases, a clarifying paragraph has been added afterward. See lines 354-369 and 390-402. Thanks again for this new tool in the R environment and the opportunity to review this package • Thank you for your review! Submitted filename: 2022-06-17-rebuttal.docx Click here for additional data file. 20 Jun 2022 Analysis of clinical trial registry entry histories using the novel R package cthist PONE-D-22-03050R2 Dear Dr. Carlisle, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Florian Naudet, M.D., M.P.H., Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 23 Jun 2022 PONE-D-22-03050R2 Analysis of clinical trial registry entry histories using the novel R package cthist Dear Dr. Carlisle: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Pr. Florian Naudet Academic Editor PLOS ONE

8 in total

1. Empirical evidence for selective reporting of outcomes in randomized trials: comparison of protocols to published articles.

Authors: An-Wen Chan; Asbjørn Hróbjartsson; Mette T Haahr; Peter C Gøtzsche; Douglas G Altman
Journal: JAMA Date: 2004-05-26 Impact factor: 56.272

2. Changes to registration elements and results in a cohort of Clinicaltrials.gov trials were not reflected in published articles.

Authors: Shelly Pranić; Ana Marušić
Journal: J Clin Epidemiol Date: 2015-07-29 Impact factor: 6.437

3. Harms of outcome switching in reports of randomised trials: CONSORT perspective.

Authors: Douglas G Altman; David Moher; Kenneth F Schulz
Journal: BMJ Date: 2017-02-14

4. World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects.

Authors:
Journal: JAMA Date: 2013-11-27 Impact factor: 56.272

5. How to avoid common problems when using ClinicalTrials.gov in research: 10 issues to consider.

Authors: Tony Tse; Kevin M Fain; Deborah A Zarin
Journal: BMJ Date: 2018-05-25

6. Prospective registration and reporting of trial number in randomised clinical trials: global cross sectional study of the adoption of ICMJE and Declaration of Helsinki recommendations.

Authors: Mustafa Al-Durra; Robert P Nolan; Emily Seto; Joseph A Cafazzo
Journal: BMJ Date: 2020-04-14

7. How informative were early SARS-CoV-2 treatment and prevention trials? a longitudinal cohort analysis of trials registered on ClinicalTrials.gov.

Authors: Nora Hutchinson; Katarzyna Klas; Benjamin G Carlisle; Jonathan Kimmelman; Marcin Waligora
Journal: PLoS One Date: 2022-01-21 Impact factor: 3.240

8 in total