Literature DB >> 35202406

Methodology to standardize heterogeneous statistical data presentations for combining time-to-event oncologic outcomes.

April E Hebert¹, Usha S Kreaden¹, Ana Yankovsky¹, Dongjing Guo¹, Yang Li¹, Shih-Hao Lee¹, Yuki Liu¹, Angela B Soito², Samira Massachi³, April E Slee⁴.

Abstract

Survival analysis following oncological treatments require specific analysis techniques to account for data considerations, such as failure to observe the time of event, patient withdrawal, loss to follow-up, and differential follow up. These techniques can include Kaplan-Meier and Cox proportional hazard analyses. However, studies do not always report overall survival (OS), disease-free survival (DFS), or cancer recurrence using hazard ratios, making the synthesis of such oncologic outcomes difficult. We propose a hierarchical utilization of methods to extract or estimate the hazard ratio to standardize time-to-event outcomes so that study inclusion into meta-analyses can be maximized. We also provide proof-of concept results from a statistical analysis that compares OS, DFS, and cancer recurrence for robotic surgery to open and non-robotic minimally invasive surgery. In our example, use of the proposed methodology would allow for the increase in data inclusion from 108 hazard ratios reported to 240 hazard ratios reported or estimated, resulting in an increase of 122%. While there are publications summarizing the motivation for these analyses, and comprehensive papers describing strategies to obtain estimates from published time-dependent analyses, we are not aware of a manuscript that describes a prospective framework for an analysis of this scale focusing on the inclusion of a maximum number of publications reporting on long-term oncologic outcomes incorporating various presentations of statistical data.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35202406 PMCID： PMC8870464 DOI： 10.1371/journal.pone.0263661

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Acute short-term outcome trials comparing robotic-assisted surgery with open surgery and non-robotic minimally invasive surgery have been evaluated for decades. Many publications have reported short-term outcomes such as length of stay and complication rates associated with robotic-assisted surgery compared to traditional open surgery and non-robotic minimally invasive surgery [1, 2]. For cancer-related surgical procedures, a comprehensive comparison of robotic-assisted surgery to traditional open or non-robotic minimally invasive alternatives should evaluate mid- to long-term oncologic outcomes in addition to these acute surgical measures. These oncologic outcomes include overall survival (OS), disease-free survival (DFS) and cancer recurrence. In contrast to acute surgical outcomes, where complete patient accounting is feasible, these oncologic outcomes are time-dependent and the event status for each patient is not always known. Both the occurrence of these outcomes and the elapsed time from surgery contribute to valid comparisons across surgical approaches. As such, these outcomes require specific techniques that are not necessary for comparisons of counts or continuous endpoints. Specifically, for short-term outcomes, it is possible to analyze the presence or absence of specific events as binary endpoints and to focus only on the proportion of patients for whom the endpoint has occurred. These analyses assume that the event status is known for each patient, which is reasonable when the outcomes are measured prior to hospital discharge or within a short time after discharge. For time-dependent outcomes such as all-cause mortality, the “endpoint” is the elapsed time from surgery to death. However, it is unlikely that every death among patients in the study would be observed during the follow-up period, so for many patients, the time of death is considered missing data. Failure to observe the time of death, or censoring, can occur for reasons other than insufficient observation time. Other common reasons that an event was not observed during a study include patient withdrawal, a competing risk event (an event that precludes observing the outcome of interest), and loss to follow-up. Fortunately, survival analysis techniques can account for failure to observe events and produce valid statistical comparisons. The single event component—time from surgery to death—is replaced by 2 components. For patients with observed deaths (or other events), these components are a binary variable indicating that the death was observed and the time from surgery to death. For patients who did not die prior to loss to follow-up, withdrawal or study conclusion, the corresponding components are a binary variable indicating that death was not observed and the last time that the patient was known to be alive (or that the event of interest had not yet happened). This framework allows adjustment for differences in follow-up time and other reasons that the follow-up data are incomplete. Other cancer-related endpoints such as recurrence and disease-free survival can be defined as similar pairs of indicators of event occurrence and the time the event occurred or the last time when the patient was known to be event-free. For oncologic meta-analyses, the follow-up period may differ considerably across studies, and longer observation time increases the chances of observing an endpoint event. This problem is compounded for retrospective comparisons, where the follow-up period may differ considerably within studies. More time has elapsed since the procedure for older surgical approaches compared to newer ones, resulting in longer amounts of follow-up and more opportunity to observe events. Thus, an analysis based on event counts with no correction for differences in elapsed time will favor the newer technique even if the true event rates are similar. While statistical techniques such as propensity score matching or covariate adjustment can reduce differences in patient characteristics within studies, these methods do not address differential follow-up. Analysis of the proportion of patients with an event at a specific time-point such as 5-year survival appears to address this problem as follow-up time is fixed across treatment groups (and studies), but this approach does not address censoring and, in general, these comparisons are not amenable to synthesis for long-term outcomes. Analysis of a single time-point can misrepresent the overall treatment effect, and this is especially true when points of maximum or minimum difference between survival curves are selected for presentation [3, 4]. Additionally, the requirement of a common time-point for meta-analyses will reduce the number of studies that can be included, and the generalizability of the findings may suffer as a result. There are several publications further detailing the need for time-dependent analyses [5, 6]. The most common recommendation for the analysis of time-to-event outcomes is to summarize intervention effects using the hazard ratio (HR) [7]. The hazard ratio compares the instantaneous risk of events across interventions. This statistic is most interpretable when the ratio of these risks is relatively constant over the follow-up period (the “proportional hazards assumption”), but it summarizes the overall reduction in event risk for one group compared to another group even when the ratio of risks is not constant over time [8]. Hazard ratios are similar in interpretation to relative risks, but they account for time and censoring in addition to the number of events. On the natural logarithm scale (Ln), the hazard ratio is linear [7, 9] and can be aggregated using the generic inverse-variance methods [8] used in meta-analysis for other statistical measures. Unfortunately, the hazard ratio is not always reported in publications. Conversely, other publications report multiple hazard ratios calculated from raw data, matched data, and statistical models. Comprehensive papers describing strategies to obtain estimates for time-dependent analyses [4, 8] have been previously published, but these reports lack the organizational framework to implement a large meta-analysis involving multiple teams of data extractors. The goal in our framework development was to maximize the number of included studies while limiting bias, to provide clear guidelines, and improve agreement in dual data extraction of each individual manuscript. Thus, we used a hierarchical decision tree to allow data extractors to identify the most appropriate hazard ratio to extract, or the most appropriate data and method to estimate the hazard ratio when it was not provided directly. The purpose of this manuscript is to describe and illustrate the development of an approach for performing systematic literature reviews and meta-analyses to summarize important time-dependent oncologic endpoints. To illustrate our approach, we describe a project comparing robotic-assisted surgery using the da Vinci surgical system to open and non-robotic minimally invasive surgical procedures (PROSPERO database CRD42021240519, analysis in progress).

Materials and methods

Methods to extract hazard ratios

We created a hierarchical decision tree using 4 methods to extract the hazard ratios and variance estimates from the publications identified for this project. Method 1 used the direct estimate of the hazard ratio when available, while Methods 2–4 used “indirect methods” to estimate the hazard ratio from available data [6]. The decision tree shown in Fig 1 was created based on the increasing number and decreasing plausibility of required assumptions and was used to determine which technique to use to extract or estimate the hazard ratios.

Fig 1

Decision tree for hazard ratio extraction: Flow chart to determine which hazard ratio estimate to use based on data provided in manuscript.

The individual Methods 1–4 are described in detail below. Methods 1 through 3 have previously been described [4, 8, 10] and are recommended by the Cochrane Handbook [7]. Our base assumption was that hazard ratios are a valid comparison of overall risk between groups in directionality and magnitude even when the hazards are not proportional, but statements quantifying the comparisons (e.g., a 5 x higher risk) should not be made in the case of non-proportionality. Our main rules were that 1) all available data, outcome definitions, and stated conclusions were utilized to determine the most valid data, method, and p-value to use, and to check the accuracy of Method 2–4 calculations, 2) when there was a judgement call needed, we selected the method that was the most conservative (most disfavored) for the cohort of interest.

Method 1

Since the goal of this analysis was to combine hazard ratios, reported hazard ratios were used as the first choice when available. Hazard ratios were typically presented with 95% confidence intervals (CI), and so variances could be obtained from confidence intervals [8]: Where “Ln” denotes the natural logarithm (loge), HR denotes the hazard ratio, CI denotes the confidence interval, Φ-1 is the inverse of the standard normal distribution, and α is the Type 1 error rate. If only the hazard ratio and p-value are reported, the confidence interval can be calculated as follows: Where “Ln” denotes the natural logarithm (loge), HR denotes the hazard ratio, and Φ-1 is the inverse of the standard normal distribution. The denominator of the ratio in equation II has the form of a z-statistic on the log scale, so equation II is essentially SE = Estimate / z. Note that the standard error cannot be negative, so the calculation should use either Φ-1(p-value/2) or Φ-1(1—p-value/2) as necessary to make the resulting standard error in equation II positive. There are related formulations of this equation that produce nearly identical results in most circumstances [11]. Once the standard error is obtained, a 95% CI for Ln(HR) is given by: Where “Ln” denotes the natural logarithm (loge), HR denotes the hazard ratio, and SE denotes the standard error. A confidence interval for HR can be obtained by exponentiating the endpoints: Where “Ln” denotes the natural logarithm (loge) and HR denotes the hazard ratio. When multiple hazard ratios were reported, the statistical analysis that produced the hazard ratio was also captured (i.e., univariable, multivariable) to follow the extraction priority. It is important to determine an extraction preference a priori for when more than one hazard ratio is reported. Our criterion was to prioritize adjusted or matched analyses over unadjusted data, and when both adjusted and matched analyses were available, to maximize group size, because analyses using entire populations account for the relative frequency of case types, severity of disease, surgeon experience, etc. and the results are more generalizable. For these reasons, we prioritized an adjusted HR using the largest sample that adequately addresses confounding (ie. adjusted analysis using whole patient population over a matched patient cohort when matching decreased the sample size). It is also crucial to convert the reference group for all HRs to the same surgical approach to perform pooled analysis. We converted all HRs so that the robotic cohort was the comparison group (i.e., not the reference group). The conversion was performed by inverting the hazard ratio and its 95% confidence interval: 1/HR [1/high CI, 1/low CI] where HR denotes the hazard ratio and CI denotes the 95% confidence interval. To identify the reference group when it was not explicitly stated, the author conclusions, event n, KM curves, and survival rates were used along with the guide shown in Table 1.

Table 1

Determination of reference group for hazard ratio: Table to assist extractors in identifying reference group, determining need to invert hazard ratios.

Hazard Ratio Direction	Open/Minimally Invasive Reference	Robotic Reference
HR > 1	Robot is Worse	Robot is Better
HR < 1	Robot is Better	Robot is Worse

Method 2

When hazard ratios were not available, the hazard ratio was estimated using log-rank analysis statistics. This method estimates the variance and hazard ratio using the number of patients and events in each arm, and the log-rank p-value [8]. For the above equations, Vr is the inverse variance of the log hazard ratio for the robotic group, OTotal is the total number of events in the robotic plus the comparison group, and Rr and Rc are the number of patients in the robotic and comparison groups, respectively. The “Ln” denotes the natural logarithm (loge) and HR denotes the hazard ratio. The Or and Er are the number of observed and expected events in the robotic group, respectively, Φ-1 is the inverse of the standard normal distribution, and the p-value is assumed to be 2-sided if not otherwise stated and from the log-rank test. Because p-values were assumed to be two-sided, manual assignment of direction was adjusted by selecting the appropriate equation from these two options: or so that the result is negative when survival in the robotic group was higher/better and positive when survival in the robotic group was lower/worse. The p-value from the log-rank test of the Kaplan-Meier curves should be nearly identical to the log-rank p-value from an unadjusted (univariable) Cox proportional hazards model; therefore, a Kaplan-Meier log-rank p-value in conjunction with an HR, a KM curve, or a KM survival estimate, was used to estimate the HR with the other Methods 2–4. If the source of the p-value was not clear, we reproduced p-values from the Chi2 and Fisher’s Exact test using cohort totals and event counts for comparison. This step was used to reduce the possibility that either of these tests was the source of the p-value before assuming it is calculated from a log-rank test. When using a p-value to calculate an HR or CI, if it is reported as “less than” (e.g., p<0.001), check for a log-rank statistic and if reported, use it to derive a more precise p-value. If no log-rank statistic is reported, the convention is to treat it as equal (e.g., p<0.001 becomes p = 0.001). This convention has the effect of biasing the resulting estimates toward the null hypothesis. The number of patients relates to the number of patients included in the analysis (n at risk at time zero) and may not always be equal to the total sample size. For example, it might be necessary to subtract patients with no follow up. For overall survival, the total number of deaths may be calculated by summing across causes of death (e.g., dead of other (DOO) + dead of disease (DOD)), subtracting the number of alive from the total sample size, or by calculating from a proportional death rate. For composite endpoints such as disease-free survival, it is important to be cautious to avoid double-counting patients that experienced multiple events (a patient that experiences recurrence and death would be counted twice if recurrence event n and a death event n were summed), but if there is no evidence it would be inaccurate, DOO + DOD + alive with disease (AWD), or overall mortality + (total recurrence minus DOD) equations could be used.

Method 3

If neither Method 1 or Method 2 could be applied based on information available in the publication, but the log-rank p-value and a Kaplan-Meier curve were available, individual patient data was reconstructed from the published Kaplan-Meier curve using the iterative algorithm recommended by Guyot et al. [10, 12, 13]. First, the Kaplan-Meier curve was digitized [14] to retrieve x (time) and y (survival) coordinates. The Guyot algorithm can be implemented using SAS or R—code and examples can be found in S1 Appendix. The Guyot algorithm makes use of the fact that the Kaplan-Meier estimator for time t is the product (∏) of the probability of surviving the most recent interval ending at time t and all intervals prior to time t: Where di and ni represent the number of deaths in interval i and the number at risk at the start of interval i, respectively. Kaplan-Meier estimates only change when an event is observed, so the timing of events is reflected by changes in the curve. The Guyot algorithm divides the follow-up period into intervals, and iteratively adjusts the distribution of censored observations and events until the survival estimate at time t is closest to the result extracted from the graph. The Guyot algorithm was evaluated using a validation exercise that involved simulating data, exporting Kaplan-Meier estimates to be used as input to the Guyot algorithm, reconstructing the individual patient-level data (IPD) using the Guyot algorithm, and comparing the hazard ratio obtained from the original and reconstructed data. This process is shown in Fig 2.

Fig 2

Simulations to verify correct implementation of Guyot algorithm: Visual illustration of simulations explored for validation of Guyot algorithm.

Details can be found in S2 Appendix.

Simulations to verify correct implementation of Guyot algorithm: Visual illustration of simulations explored for validation of Guyot algorithm.

Details can be found in S2 Appendix. This simulation exercise demonstrated that the hazard ratio from the Guyot reconstructed IPD was an accurate estimate of the original hazard ratio used in the simulation. We found the greatest deviations from the original hazard ratio when there were relatively few events (95% censored observations), or no information about n at risk was available. In these cases, published methods provided in Tierney et al. [4] were utilized to obtain or estimate the n at risk. Once the individual patient data had been re-constructed using this algorithm, the number of events in each arm was obtained and the Method 2 estimation procedure was applied. The hazard ratio can also be calculated directly using Cox Proportional Hazards. This was done as part of the simulation study; we calculated the hazard ratio directly from the Guyot IPD and indirectly using Method 2 and compared them. We found no material difference in the resulting hazard ratios, which confirmed that our approach of using the number of events and Method 2 was valid (see results in S2 Appendix). However, similar to what others have observed, both methods are less accurate compared to the simulated value when the proportional hazards assumption does not hold [15]. The use of Method 3 is associated with additional considerations. If the number of events is few enough, manually counting them may be an option, but the result should be confirmed by manually calculating the KM estimate. The quality of the published image affects the ability to accurately count or digitize the Kaplan-Meier curve. There were a few instances where poor resolution and distortion precluded accurate extraction and the use of Method 3, so in those cases, we used Method 4 when possible. If no KM curve is shown, but the time of each event was reported along with summary information about the follow-up distribution, an approximation to the KM curve can be constructed manually. As a quality control step, we produced Kaplan-Meier graphs for each study endpoint that utilized Method 3 and compared these graphs to the original published graphs, refining the Guyot input until they matched. The log-rank p-value can be used to adjust censoring in the Guyot algorithm when n at risk over time is not reported. We also used the log-rank p-value to help determine how well the differences in the reconstructed data reflected differences in the published data. Lastly, some manuscripts included 3 cohorts (e.g., open, laparoscopic, and robotic), with a single p-value testing the overall null hypothesis that there are no differences in survival curves [16]. When the overall p-value seemed to reflect pairwise differences (assumption of similar variances), then the 3-way p-value was assumed to be a reasonable approximation of each pairwise comparison. When the overall p-value clearly did not reflect pairwise differences, but the re-constructed data matched the publication relatively well, and the overall p-value from the reconstructed data was similar to the publication, then the reconstructed data were used to calculate pairwise p-values. An additional limitation of this approach is that it may result in type 1 error inflation because the overall comparison is not equivalent to two comparisons to a control. If these criteria were not met, we used Method 4.

Method 4

Method 4 was used in instances when a time-dependent analysis was performed, but it was not possible to account for the censoring distribution because the reported information was limited to a single time point per cohort, either in the form of Kaplan-Meier estimates, or median survival. Method 4a: If Kaplan-Meier estimates for at least one timepoint and a log-rank p-value were reported, the number of events was estimated by multiplying the Kaplan-Meier estimate by the total number of patients to get the number of patients without events, and then subtracting that number from the cohort total to get the number of patients with an event (if failure was presented, no subtraction was performed). We determined a priori which timepoint would be used when several Kaplan-Meier estimates were provided. We preferentially used the latest timepoint. Method 2 was then applied to these estimated event counts. This approach assumes no censoring, that the survival curves do not cross after the estimated time point, and that the hazards are relatively proportional, but it is generally preferable to excluding the study as information about the comparisons across groups will be consistent as long as the hazards are relatively proportional over the study follow-up period and the censoring mechanism is not different across interventions. We also utilized the conclusions of the authors to determine if this approach would accurately reflect the overall comparison between cohorts, and discrepancies were cause for excluding the data if the results and conclusions conflicted and the correct result was unclear. These assumptions may or may not be reasonable and should be mentioned as a limitation when this approach is employed. We strongly recommend a sensitivity analysis excluding Method 4 to understand the possible impact of these assumptions on the overall conclusions. Method 4b: Median survival is not often reported following curative surgical resection of cancer, especially when follow up is 5 years or less, as it requires enough deaths to reach 50% survival. However, if this is the only time-dependent analysis reported, it may be possible to estimate an HR using median survival time in each group and the number of events in each group. Using the median survival produces a reasonable estimate when the cohort sizes are similar and when there is a constant event rate. The approximation using median survival should be used with caution as it relies on exponentially distributed events (that is, not only the ratio of event rates but the underlying event rate must be approximately constant) and similar cohort sizes [17]. Where HR is hazard ratio, SE is standard error, “Ln” denotes the natural logarithm (loge), Or and Oc are the number of observed events for robotic and comparison groups, respectively, and CI is the confidence interval. For papers where a time-to-event analysis was not performed and when we were unable to calculate a hazard ratio estimate using Method 1 through Method 4, dichotomous event data were summarized as a relative risk. In cases where zero event counts precluded estimation of a relative risk, a risk difference was used instead. The relative risks are not adjusted for time and censoring, which is problematic for the reasons described above and for this reason, we did not include them in the meta-analysis of HR.

Data extraction methods

A data extraction form was created to capture all information related to time-dependent endpoints, with separate tabs for overall survival, disease-free survival, and disease recurrence. This form had 6 categories of data; sample sizes (total number of patients in each cohort, number of patients included in each analysis, n at risk, and number of patients in matched or subgroup comparisons), Method 1 data (HR, CI low, CI high, p-value, statistical test, reference group and evidence for that designation), Method 2 data (the number of events or non-events in each cohort or the event proportions, the p-value and whether it was a log-rank, Cox proportional hazard, or other p-value, and the time point), Method 3 data (the figure # of the Kaplan-Meier curve), Method 4 data (the survival or death Kaplan-Meier estimate, the p-value and statistical test, and the time point of the estimate), and the conclusion of the authors for each comparison and each outcome (e.g., “the robotic cohort showed improved overall survival but no difference in disease-free survival compared to the open cohort”). A condensed version is shown in Table 2.

Table 2

Comparison of hazard ratios Method 1 through 4.

Comparison showing information required for calculation of hazard ratio using various methods and resulting hazard ratio and 95% confidence interval.

	N	Method 1	Method 2	Method 3	Method 4a	Method 4b
Cohort		HR [95% CI]	Deaths	Est. event n	OS 3yr	Median Survival, Deaths
Robotic	300	Ref	105	107	58.6%	3.8 yr, 105
Open	300	1.47 [1.14, 1.90]	133	132	44.5%	2.5 yr, 133
p-value		0.0032	0.003	0.003	0.003
Calculated HR [95% CI]		0.68 [0.53, 0.88]	0.68 [0.53, 0.88]	0.68 [0.53, 0.88]	0.71 [0.56, 0.89]	0.66 [0.51, 0.85]

Comparison of hazard ratios Method 1 through 4.

Comparison showing information required for calculation of hazard ratio using various methods and resulting hazard ratio and 95% confidence interval. In addition, for each data type, the location in the paper where the data were found was documented to facilitate the quality control assessment. Statistical methods were also recorded, including any methods used to control for cohort imbalances (modeling with covariate adjustment, propensity score matching or weighting, other). Extractors were trained using a small subset of papers to improve the uniformity of data recording. Data were extracted as reported and all calculations were performed in a separate section of the spreadsheet to maintain an accurate accounting of all data, decisions, and calculations. Dual extraction was used to ensure data quality, and discrepancies were resolved by consensus with at least one additional reviewer [12].

Quality control methods

Although the hierarchy was used to determine which hazard ratio to use for the analysis, when feasible, the estimates using the other methods were calculated for comparison. This strategy allowed for comparison of the magnitude and direction of the hazard ratio under various methods and helped to identify errors and other issues. Even when Method 3 was not used, if a Kaplan-Meier curve was presented and the hazards appeared relatively proportional, comparison of the hazard ratio to this graph at least for directionality was useful. Finally, comparison to the authors’ conclusions was a valuable check to ensure that any assumptions made in Methods 1 through 4 did not lead to an inaccurate representation of the published data. There were cases when the hazard ratios under multiple methods did not align. The most common reasons for discrepancies were related to comparing hazard ratios adjusted for covariates to estimates derived from unadjusted data, or in a few cases, distributions that would probably violate the proportional hazards assumption. We investigated differences across methods until an explanation could be found and used the method that most aligned with our method hierarchy and the conclusions of the authors. When more than one analysis was reported, we selected an adjusted or matched analysis preferentially, and only if no other data were available, unadjusted data; we chose the largest available analysis that adequately addressed cohort differences using the same hierarchy as listed above. When no adjustment or matching was performed, the comparability of the groups was determined by comparing baseline values for a list of covariates potentially related to oncologic outcomes. These two subgroups were analyzed and presented separately as well as combined for a total pooled result to identify any instances where adjustment provided a different result from analyses with comparable groups. Due to the differences in the strength and number of assumptions needed for the various methods used, an analysis of the heterogeneity of outcomes based on method type should be performed.

Results

Worked example

We illustrate the four methods described above based on a simulated robotic versus open data set and we compare the resulting HR estimates (Table 2). For Method 1 the “reported” HR is: 1.47 [1.14, 1.9] with the robotic group as the reference. To switch to the open group as the reference, HR = 1/1.47 [1/1.9, 1/1.14] = 0.68 [0.53, 0.88]. Table 3 shows the worked example for Method 2, with the event n as reported in the paper entered in rows 7 (robotic) and 8 (open). The data for rows 7 and 8 can be obtained from one of three sources, directly from the manuscript (Method 2), estimated from the KM curve and Guyot algorithm (Method 3-Fig 3), and calculated from the KM survival estimate at the latest time point (Method 4a) by multiplying the survival estimate with n at risk for # alive, and then subtracting from n at risk to get the estimated number of patients who died. For the simulated data set, KM survival estimates at 3-years were 58.6% Robotic versus 44.5% Open, so the calculation would be 300-(58.6% x 300) = 124 for the robotic group and 300-(44.5% x 300) = 167 (Fig 3). For Method 4b, median survival and the event n can be used to calculate the HR and CI (Table 4). We are also providing the R code in S3 Appendix.

Table 3

Worked example using Method 2: Hazard ratio calculated using event counts.

HR = Hazard Ratio, CI = Confidence Interval, eq. = equation, est. = estimated, Vr is the inverse variance of the log hazard ratio for the robotic group, OTotal is the total number of events in the robotic plus the comparison group, “Ln” denotes the natural logarithm (loge), Φ-1 is the inverse of the standard normal distribution, and the p-value is assumed to be 2-sided and from the log-rank test if not otherwise stated. Because p-values were assumed to be two-sided, manual assignment of direction was adjusted by multiplying the term by -1 or 1 so that the result was negative when survival in the robotic group was higher/better and positive when survival in the robotic group was lower/worse.

		D	F
		Example with equations	Example with values
	Raw Data	Robotic vs Open	Robotic vs Open
4	R_r (Total number of patients: robotic)	300	300
5	R_c (Total number of patients: control)	300	300
7	O_r (# deaths reported: robotic)	105	105
8	O_c (# deaths reported: control)	133	133
9	Log-Rank p-value (KM or Cox PH)	0.003	0.003
	Calculations
11	Estimated death rate: robotic	= D7/D4	0.35
12	Estimated death rate: control	= D8/D5	0.443
13	Difference in est. death rate (r-c)	= D11-D12	-0.093
14	Direction of difference (enter: 1 if D13 is positive or -1 if D13 is negative)	-1	-1
16	eq. (V) = V_r = (O_totalR_rR_c)/(R_r+R_c)²	= (((D7+D8)D4D5)/((D4+D5)^2))	59.5
17	eq. (VI) = Variance Ln(HR) = 1/V _r	= 1/D16	0.0168
18	eq. (VII) = O_r—E_r = (√(O_totalR_rR_c)/(R_r+R_c))Φ^-1(1-p-value/2)(direction of difference)	= (SQRT((D7+D8)D4D5)/(D4+D5))(NORM.S.INV(1-D9/2)D14)	-22.89
19	eq. (VIII) = ln(HR) = (O _r -E _r )/V _r	= D18/D16	-0.385
21	HR = e ^Ln(HR)	= EXP(D19)	0.68
22	95% CI Lower = e ^(Ln(HR)-1.96 * ^{√(variance Ln(HR)))}	= EXP(D19-1.96*SQRT(D17))	0.53
23	95% CI Upper = e ^(Ln(HR)+1.96 * ^{√(variance Ln(HR)))}	= EXP(D19+1.96*SQRT(D17))	0.88

Fig 3

Kaplan-Meier curve worked example: Example data for Method 3 Guyot algorithm and worked example.

Panel A is a graph that might appear in a publication. Panel B shows the “digitized” version with time and KM points, and panel C shows the re-constructed individual patient data using the digitization and n at risk as input. See S1 Appendix for full R code.

Table 4

Worked example using Method 4b: Hazard ratio calculated using median survival estimates.

HR = Hazard Ratio, CI = Confidence Interval, eq. = equation, “Ln” denotes the natural logarithm (loge).

		D	F
		Example with equations	Example with values
	Raw Data	Robotic vs Open	Robotic vs Open
4	Total number of patients: robotic	300	300
5	Total number of patients: control	300	300
7	O_r (# deaths reported: robotic)	105	105
8	O_c (# deaths reported: control)	133	133
9	MS_r (Median Survival (in months): robotic)	3.8	3.8
10	MS_c (Median Survival (in months): control)	2.5	2.5
	Calculations
12	eq. (X) = HR = MS _c /MS _r	= D10/D9	0.66
13	eq. (XI) = Standard Error Ln(HR) = √(1/O _r +1/O _c )	= SQRT(1/D7+1/D8)	0.13
14	eq. (XII) = CI Lower = ln(HR) − 1.96*SE Ln(HR)	= LN(D12)-1.96*D13	-0.67
15	eq. (XII) = CI Upper = ln(HR) + 1.96*SE Ln(HR)	= LN(D12)+1.96*D13	-0.16
17	Exponentiate CI Lower = e ^{(ln(HR) − 1.96} * ^{SE Ln(HR)}	= EXP(D14)	0.51
18	Exponentiate CI Upper = e ^{(ln(HR) + 1.96} * ^{SE Ln(HR)))}	= EXP(D15)	0.85

Kaplan-Meier curve worked example: Example data for Method 3 Guyot algorithm and worked example.

Worked example using Method 2: Hazard ratio calculated using event counts.

Worked example using Method 4b: Hazard ratio calculated using median survival estimates.

HR = Hazard Ratio, CI = Confidence Interval, eq. = equation, “Ln” denotes the natural logarithm (loge).

Advantages of method hierarchy

To demonstrate the benefits of a hierarchal approach to hazard ratio extraction or estimation, the distribution of methods used in our analysis to date was calculated. Data from 199 papers were available for inclusion, with some reporting multiple outcomes. When limiting the analysis to papers that used adjustments or matching to account for differences between groups and papers where the groups were comparable, use of Methods 2–4 increased the available HRs from 108 (Method 1) to 240 HRs (Methods 1–4), facilitating an increase of 122%. Method 1 was the most used, accounting for about 45% of HRs across the various outcomes. About 15% of hazard ratios were derived using Method 2. Method 3 was the second most common with 28%, and Method 4 was the least common with 12%. The hierarchy of methods led to a dramatic increase in the number of papers that could be included in the analysis compared to restriction to publications reporting hazard ratios. To check the accuracy of including Methods 2–4, we performed simulations. In these simulations, Methods 2 and 4 resulted in hazard ratio estimates that fell within the original confidence interval >99% of the time. However, because it is difficult to simulate non-proportional hazards, truncated p-values, and other challenges in actual published reports, we examined all available publications for overall survival that provided a HR and allowed the use of at least one indirect method for estimating an HR. Comparing reported hazard ratios (Method 1) to estimated HR using Methods 2–4 within analysis type (ie. HR and KM both performed on matched cohort), showed that ≥90% of the estimated HRs fell within the 95% confidence interval of the reported HR (Method 2: 93%, Method 3: 96%, Method 4: 90%). The main reason papers fell outside of the confidence interval was that truncated p-values were reported (e.g. p<0.001). While this is not a fundamental inaccuracy of the method, it is an issue that arises in meta-analyses. Removing papers that report truncated p-values would result in a more precise, but less accurate analysis as it selectively removes papers with highly significant results. This type of check, along with sensitivity analyses based on publication quality, risk of bias, methods used, or number of additional assumptions required, can be useful to find the optimal balance between bias and precision based on the available literature.

Discussion

This manuscript describes a hierarchical ordering of 4 methods for obtaining a hazard ratio or a hazard ratio estimate and illustrates and compares the results for a simulated example. It also provides recommendations for data extraction. Our methods add to the list of previously published techniques for analyzing time-to-event outcomes [4, 5, 8], and include details for implementing our strategies (S1, S3 and S4 Appendices—Guyot code, R functions for HR estimate calculations, and Data extraction & tricks). Finally, this report describes the motivation and assumptions underlying key process decisions. The Cochrane Handbook recommends the methods we have labeled Methods 1–3 as valid ways to extract and pool hazard ratios for time-dependent data [7] and Tierney et al. 2020 [18], prioritized data extraction consistent with our ordering of Methods 1 through 3. They evaluated 18 systematic reviews in oncology with a preference for direct use of the hazards or hazard ratio, followed by estimation of the hazard ratio based on the survival analysis p-value and number of events, followed by extraction from the KM curve (though the extraction technique differed slightly [18]). However, neither of these prior publications provided the level of detail in performing these calculations demonstrated here. In practice, the preparation of data for meta-analysis is time consuming, requires many calculations (possibly with manual adjustments), and consists of a multitude of assumptions to check, caveats to consider, and ad hoc methods for special cases. This paper compiles multiple methods for calculating hazard ratios with clear instructions, a comprehensive worked example, practical tips, additional helpful calculations, and considerations relative to the assumptions. The goal was to provide enough information for others to reproduce this work. The most important potential limitation is that the use of the hazard ratio to summarize time-to-event data may not be ideal. While certainly preferable to methods that do not account for time and censoring, the proportional hazards assumption is rarely evaluated in practice. An analysis of 115 trials found that only 4 described efforts to test the proportional hazards assumption [19]. Interestingly, this study also demonstrated that the proportional hazards assumption is more likely to fail when treatments with different mechanisms of action are compared. There are reasonable alternatives to the hazard ratios that are easier to interpret when the assumption of proportional hazards is dubious. These include the Restricted Mean Survival Time, which is the average from time 0 to a specified point [19]. However, this estimate is seldom provided in practice, and its use would likely result in the exclusion of studies presenting only a hazard ratio. Beyond evaluation in individual studies, the proportional hazards assumption is clinically questionable if the surgical approach only impacts short-term surgery related outcomes or the ability to fully excise the cancer. This assumption may be problematic when combining studies from vastly different follow-up durations, and an analysis restricted to longer studies (≥5 years) could be helpful to elucidate any issues when a sufficient number of longer studies become available. Extraction of IPD using the Guyot algorithm allows for investigation of the proportional hazards assumption and alternatives when invalid; we hope that the illustrations in 2 different statistical packages will make this powerful technique more accessible. However, the accuracy of this method depends on the quality of Kaplan-Meier figures and the completeness of the manuscript in reporting n at risk. An analysis of 125 oncology publications found that any post-baseline n at risk information was reported for about half of the manuscripts, and n at risk for at least 4 time-points was reported for about a third [20]. Our findings were similar. More journals are starting to require this information, but until this requirement becomes ubiquitous, IPD extraction will be a useful alternative but not a panacea. When the hazard ratio is a reasonable choice, the next consideration is the extraction process. The rationale for developing our process was to maximize the number of papers included in the analysis, while consistently providing the best possible estimate of the hazard ratio based on the data reported. These simultaneous goals necessitate the inclusion of lower quality publications (from a level of evidence/risk of bias standpoint) in the analysis. In addition, the assumptions required to include some papers cannot be assessed using the published results (e.g., the assumption of proportional hazards when no Kaplan-Meier curve was available). Such assumptions, if invalid, could potentially decrease the accuracy of the final pooled result. However, the alternative of limiting meta-analyses of time-dependent outcomes to publications reporting hazard ratios can greatly reduce the number of papers available for analysis, alter the conclusions, and introduce bias. The addition of methods 2–4 allowed us to include manuscripts that did not provide a hazard ratio, and the decision tree allowed us to be systematic in choosing the best possible estimation method when data were available for more than one method. Overall, long-term outcomes were reported in 199 articles in some form. Use of Methods 2–4 increased the available HRs from 108 (Method 1) to 240 HRs (Methods 1–4), facilitating an increase of 122% in the included outcome assessments. The meta-analysis technique assumes that all relevant studies are available (fixed model), or at least a random sample of studies (random model) can be used. Failure to include studies that do not present hazard ratios creates a special case of reporting bias; it is possible that the reporting of a hazard ratio correlates with other factors, such as the involvement of a statistician, degree of a priori analysis specification, affluence, centers of excellence, etc. that could cause systemic bias and alter the conclusions. More stringent selection criteria including analysis methods could reduce the generalizability of the conclusions by limiting surgeon experience and patient characteristics. Our decision to maximize studies and patients also mandated inclusion of a wider range of publication types, including retrospective comparative studies. In general, the appropriateness of this decision depends on the clinical context of the project. The inclusion of observational data in meta-analyses is controversial as the heterogeneity and risk of bias within and across studies is much higher than for randomized trials [21]. Concerns about individual and aggregate observational study results are warranted. In our project, some observational studies used techniques such as Cox proportional hazards regression and propensity scores to reduce selection bias and confounding. However, differences in analysis methods and adjustment factors may increase heterogeneity or raise questions about comparability [21]. In our case, the inclusion of observational studies was imperative; of the 199 total manuscripts with time-dependent outcomes identified, only 7 of them were randomized trials and two of the surgical procedures had no RCT representation at all. There are very few randomized controlled trials that have long enough follow up to be able to report on survival outcomes. In light of the paucity of RCTs, the question is how to have confidence in a result based primarily on observational data. We are assuming that the benefits of increasing the number of publications outweigh the additional risk of bias that may arise from including non-randomized comparison papers reporting long-term outcomes of interest. Formal assessment of risk of bias is standard in reporting results, and tools are available for both randomized (e.g. Cochrane) and non-randomized (e.g. Robins-I, Ottawa Newcastle) studies. These, or other measures of bias risk along with sensitivity analyses, could be used to further investigate the trade-off between including a broader cross section of the literature and the increased risk of bias. Though less reliable than randomized trials, there are some advantages to observational studies. Randomized trials are prone to spectrum bias when onerous inclusion and exclusion criteria eliminate large swaths of the patient population, and a protocol-driven, standardized approach to clinical decision-making may differ markedly from real-world practice [22]. Novel surgical approaches are often adopted by a few innovative centers and are not well-represented in national databases (SEER, NCDB) for many years after introduction. Randomization remains the best technique for the removal of systemic imbalances, and whether observational data can produce results with similar validity will continue to be hotly contested (as noted by a recent publication titled, “The magic of randomization versus the myth of real-world evidence” [23]). However, limiting our analysis to papers that performed adjustment or matching or to papers where the cohorts were comparable helps mitigate these issues.

Conclusions

Meta-analysis of time-dependent outcomes using hazard ratios can produce valid synthesis, and the ability to obtain hazard ratio estimates from the available information can aid in this goal. Though meta-analyses pooling hazard ratios are easier to interpret when the proportional hazards assumption holds, the hazard ratio is a valid summary statistic, adjusted for time and censoring, even when this assumption is violated. Providing a practical guide for the implementation of best practices for long-term outcome analysis based on hazard ratios will hopefully reduce methodological obstacles that currently preclude them. Sensitivity analyses can be added as needed to address issues such as heterogeneity across results. Future studies or technical advancement proposing methodology to streamline the implementation of the Guyot method and software automating the calculations based on the formulae described above would also be useful.

Guyot SAS and R-code and examples.

(DOCX) Click here for additional data file.

Validation of Guyot algorithm.

(DOCX) Click here for additional data file.

R functions for HR estimate calculations.

(DOCX) Click here for additional data file.

Assumptions, rules, and tips.

(DOCX) Click here for additional data file. 4 Aug 2021 PONE-D-21-18912 Methodology to Standardize Heterogeneous Statistical Data Presentations for Combining Time-to-Event Oncologic Outcomes PLOS ONE Dear Dr. Slee, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Sep 18 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols . We look forward to receiving your revised manuscript. Kind regards, Mona Pathak, PhD Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This interesting manuscript outlines a decision process for selecting effect sizes, and associated precision, from articles (and other reports) of survival (overall, disease-free, and recurrence) where these are not explicitly presented using hazard ratios and CIs/SEs, covering both observational and RCT study designs. In a case study, they demonstrate that many more results (and more studies) could be included in a meta-analysis compared to limiting inclusion to those reporting HRs and CIs or SEs. While increasing the number of results able to be included is admirable, as a biostatistician, I would also have significant concerns about the bias versus precision trade-off inherent in this hierarchy. While increasing from 115 HRs to 303, all other things being equal, would be a highly desirable outcome, the additional noise and possible biases from the approximations seems worth more attention here, and not just study quality considerations. Have you performed simulations to identify the magnitude of numerical bias introduced by using these approximations? Can you summarise results from simulation-based articles if not? I appreciated the consideration of this point in the discussion, I’m wondering here about practical advice to the reader about how to approach this trade-off. I was surprised to see no mention of competing risks survival analyses, as far as I could tell, particularly given that I would expect prognostic questions to be much more important here. For example, both disease-free and recurrence outcomes would have other-cause mortality as a competing risk for prognostic research questions. I was expecting that the presence of a competing risk event would be explicitly mentioned at the end of the second paragraph on the second page, along with “patient withdrawal and loss to follow-up.” This also presents a general challenge in systematic reviews and meta-analyses in oncology, and many other areas, where both HRs and SHRs are likely to be reported. I wonder if the authors could incorporate competing risks at appropriate places into their manuscript. As a minor comment, I wonder if explaining the use of Phi in equation I (and subsequent) might not be a kindness to the non-statistical reader (this information is eventually revealed after Equation III), perhaps along with its value for (at least) 95% CIs. Similarly, some readers might appreciate being reminded about alpha here and subsequently. I think readers should be very clear about all notation immediately after reading any equation, even when the notation is as standard as it is here. As another minor comment, “univariate” and “multivariate” are generally used by (bio)statisticians to refer to dependent variables, with “univariable” and “multivariable” (referring to independent variables) perhaps being the intention on the second paragraph after Equation II (and elsewhere)? More importantly, in this paragraph, I wonder if readers would benefit from some clearer explanation earlier in the manuscript of what to do when “multiple hazard ratios were reported” in terms of different sets of adjustment variables or when complete case and imputation results are presented, as examples? While RCTs ought not to include confounders, unless there are differential missing data mechanisms, favouring models that include variables used in allocation (stratification or minimisation variables) and competing exposures, while excluding variables potentially on the causal pathway, might be useful advice (perhaps brief with references to further information) here? The advice in “Specifically, the order of preference in this analysis for HRs was: unmatched unadjusted or univariate, unmatched adjusted, unmatched multivariate, matched unadjusted or univariate, matched adjusted, matched multivariate.” and in Figure 1 to favour unadjusted comparisons and those that break matching both go against standard (bio)statistical advice in my view (as a biostatistician) and I think need very careful and well-referenced justification in the body of the manuscript here. “Downgrading” all reported associations to the lowest common denominator, which seems to be the goal here I think, does not strike me as the obvious preference and needs some strong justification in my view. I appreciate that more discussion of this point is in the supplement (S3), but I think this needs to be incorporated into the manuscript, even if only briefly. Method 4 seems the most open to challenge, also not being a Cochrane “standard” approach, and as the authors note this make assumptions including there being no censoring (or at least censoring mechanisms that are equivalent across treatments). While I appreciate their comment about including this as a limitation, I wondered if a sensitivity analysis with such studies being removed might not also be an appropriate treatment here. Why is using median survival not explicitly included in Figure 1 given its discussion in the text (this involves two time points rather than one and so seems distinct to me)? I think all four methods would be more likely to be understood by readers, and so used appropriately, if “worked” examples of each were added to the body of the manuscript. Methods I and II should be simple enough here, method III would be nicely illustrated with a figure, and method IV could be usefully shown using both survival through to a fixed timepoint and median survival. I do appreciate the references the authors have included, but I think that the greatest value of this manuscript would be as a guide and a tutorial at the same time, and so, on that basis, I feel that it should be reasonably self-contained (as much as is possible). I appreciate that the supplement (S3) includes much important information about the practice of using these methods, but even there, I’d like to see the essentials moved into the manuscript as I fear many readers will not read the supplements in detail (or will not connect some of the “tips” to the process explained within the body of the manuscript). Your rule to use the most conservative HR, for example, might be useful enough to warrant being included in the manuscript. Similarly, dealing with p-values reported as inequalities, seems rather crucial to me. I don’t wish to be snobbish, but the 3D exploded pie chart in Figure 4 seems to me to be a very questionable way of presenting these four values (the information to ink ratio is very low here) and could easily be deleted given that this information is presented in the text. As a final minor comment, I suggest not using “subjects”, as in “when the goal was to maximize the number of included papers and the number of included subjects” (but note at least four other such uses) and instead using “patients” or (if you feel it is appropriate) “participants”. Reviewer #2: Highly appreciated the author’s effort to consolidate methods for calculating hazard ratio, the most appropriate statistic for a meta-analysis of time dependent outcomes, with clear instructions, a comprehensive worked example, practical tips, and additional calculations. Please address the following concerns Query 1: Pages 10 to 12: Use of roman numerals to quote equation number prior to the equation. It may be kept after the equation in the right margin for further use Query 2: P14, Line 6 : While describing Method 3, the authors specified that “When the overall p-value seemed to reflect pairwise differences (assumption of similar variances), then the 3-way p-value was assumed to be a reasonable approximation of each pairwise comparison”. This may increase type 1 error rate due to multiple pairwise comparison. How do you address the difference in type I error rate due to each pairwise comparison?? Query 3: Decision Tree for Hazard Ratio Extraction specified in Figure 1 is remarkable. However the method 4 mentioned by the authors hold the assumption of no censoring; which is difficult to consider in real research studies. How do you justify this? ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Andrew R Gray Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 29 Oct 2021 We wish to re-submit the original research article (PONE-D-21-18912) entitled: “Methodology to Standardize Heterogeneous Statistical Data Presentations for Combining Time-to-Event Oncologic Outcomes” for consideration by PLOS ONE. We have addressed the reviewers’ comments and we have uploaded a rebuttal letter that responds to each point raised by the academic editor and reviewer(s) in a file: ‘Response to Reviewers’. This file also outlines the changes to the manuscript, as requested. We are also uploading a tracked changes version of the manuscript ('Revised Manuscript with Track Changes modified’) and a clean version of the manuscript (‘Manuscript modified’) with all of the proposed changes. We also updated the Financial Statement to reflect that all authors are employees or consultants of Intuitive Surgical and all work related to this paper was performed in the course of usual work operations and was not based on a grant or any other study-specific funding award as follows: “Funding Statement All authors are employees of Intuitive Surgical, Inc. (AEH, USK, AY, DG, YLi, SL, YLiu) or consult for Intuitive Surgical, Inc. (ABS, SM, AES) and all work related to this paper was performed in the course of usual work operations and was not based on a grant or other study-specific funding award.” We thank the reviewers for their time and their thoughtful suggestions, which we think strengthen the manuscript. PONE-D-21-18912 Methodology to Standardize Heterogeneous Statistical Data Presentations for Combining Time-to-Event Oncologic Outcomes PLOS ONE Response to Reviewers’ Comments: Reviewer #1: This interesting manuscript outlines a decision process for selecting effect sizes, and associated precision, from articles (and other reports) of survival (overall, disease-free, and recurrence) where these are not explicitly presented using hazard ratios and CIs/SEs, covering both observational and RCT study designs. In a case study, they demonstrate that many more results (and more studies) could be included in a meta-analysis compared to limiting inclusion to those reporting HRs and CIs or SEs. Question #1: While increasing the number of results able to be included is admirable, as a biostatistician, I would also have significant concerns about the bias versus precision trade-off inherent in this hierarchy. While increasing from 115 HRs to 303, all other things being equal, would be a highly desirable outcome, the additional noise and possible biases from the approximations seems worth more attention here, and not just study quality considerations. Have you performed simulations to identify the magnitude of numerical bias introduced by using these approximations? Can you summarise results from simulation-based articles if not? I appreciated the consideration of this point in the discussion, I’m wondering here about practical advice to the reader about how to approach this trade-off. Response to Question #1: There are two issues at play: whether these methods work in theory (which can be addressed through simulation studies) and whether they work in practice (with real-world data and reporting issues). To address whether these work in theory, we have performed simulations. We had already run simulations testing Method 3 (see Supplemental Appendix S2) and we utilized the same data set to run simulations for Methods 2 and 4. We found extremely high accuracy for all methods. For Method 3, the results of these simulations are reported in Supplemental Appendix S2 and they show that if the proportion censored is less than 95% and n at risk are provided, the estimated HR is very accurate. This finding matches the simulation studies conducted by Guyot {Guyot P, Ades AE, Ouwens MJNM, Welton NJ. Enhanced secondary analysis of survival data: reconstructing the data from published Kaplan-Meier survival curves. BMC MED RES METHODOL. 2012;12:9}. For Method 2, 1300 simulations resulted in a mean of 0.000 SD 0.017 and showed an HR ratio of 1.000 [0.999, 1.001] and for Method 4, 1300 simulations resulted in a mean of 0.001 SD 0.019 and showed an HR ratio of 1.001 [0.999, 1.002]. This is consistent with the asymptotic properties of these estimators {Kalbfeisch, J. D. and Prentice, R. L. The Statistical Analysis of Failure time Data, Oxford University Press, Oxford, 1980; Tsiatis, A. A. The asymptotic joint distribution of the efficient scores test for the proportional hazards model calculated over time. Biometrika, 68(1), 311-315 (1981)}. Unfortunately, there are many factors that impact the accuracy of these estimates that are difficult to simulate. Some are issues of reporting, such as truncated curves, truncated p-values, and missing information. Others are related to the shape of the Kaplan-Meier curves themselves (deviations from proportional hazards). Therefore, to address whether these work in practice, we performed an accuracy check of the HRs estimated using Methods 2-4 compared to the HRs reported by the authors for one outcome. Papers that reported an HR and sufficient data to estimate at least one additional HR using Methods 2-4 were selected. Analyses based on similar data (ie. matched Method 1 HR vs. matched Method 2 HR) were compared by determining the number of estimated HRs that fell within the 95% confidence interval of the reported HR. Note that this is not an entirely unbiased assessment of error because most manuscripts that presented proportional hazards models did so because the authors were suspicious about residual confounding. (please see figure provided in 'response to reviews' file) Even in these suboptimal conditions (because the presence of an HR implies the need to adjust for confounders), ≥ 90% of the estimated HRs fell within the 95% confidence interval of the reported HR (Method 2: 93%, Method 3: 96%, Method 4: 90%). The main reason papers fell outside of the confidence interval was because truncated p-values were reported (e.g. p<0.001). While this is not a fundamental inaccuracy of the method, it is an issue that arises in meta-analyses. Removing papers that report truncated p-values would result in a more precise, but less accurate analysis as it selectively removes papers with highly significant results. In addition, a larger study assessed the accuracy of the 4 most popular methods for extracting data from Kaplan-Meier curves (Guyot, Parmar, Williamson, Hoyle and Henley) (our Method 3) using a similar technique and including nearly 200 published Kaplan-Meier curve/hazard ratio pairs from regulatory submissions of cancer treatments. This study found that the Guyot technique produced the most accurate estimates and the smallest bias of the four techniques. The error rate on the hazard scale was immaterial (HR: 1.0094 and 95% CI 0.998 to1.020) {Saluja R, Cheng S, Delos Santos KA, Chan KKW. Estimating hazard ratios from published Kaplan-Meier survival curves: A methods validation study. Res Synth Methods. 2019 Sep;10(3):465-475. doi: 10.1002/jrsm.1362. Epub 2019 Jun 24. PMID: 31134735}. We also agree that it is important to make the reader aware of these issues. We have added a recommendation to perform sensitivity analyses based on method and risk of bias or quality assessments and have made the following changes to the manuscript as follows: Methods, p-value section: “When using a p-value to calculate an HR or CI, if it is reported as “less than” (e.g., p<0.001), check for a log-rank statistic and, if reported, use it to derive a more precise p-value. If no log-rank statistic is reported, the convention is to treat it as equal (e.g., p<0.001 becomes p=0.001). This convention has the effect of biasing the resulting estimates toward the null hypothesis.” Method 4 methods: “We strongly recommend a sensitivity analysis excluding Method 4 to understand the possible impact of these assumptions on the overall conclusions.” Advantages of Method Hierarchy: “To check the accuracy of including Methods 2-4, we performed simulations. In these simulations, Methods 2 and 4 resulted in hazard ratio estimates that fell within the original confidence interval >99% of the time. However, because it is difficult to simulate non-proportional hazards, truncated p-values, and other challenges in actual published reports, we examined all available publications for overall survival that provided a HR and allowed the use of at least one indirect method for estimating an HR. Comparing reported hazard ratios (Method 1) to estimated HR using Methods 2-4 within analysis type (ie. HR and KM both performed on matched cohort), showed that ≥90% of the estimated HRs fell within the 95% confidence interval of the reported HR (Method 2: 93%, Method 3: 96%, Method 4: 90%). The main reason papers fell outside of the confidence interval was that truncated p-values were reported (e.g. p<0.001). While this is not a fundamental inaccuracy of the method, it is an issue that arises in meta-analyses. Removing papers that report truncated p-values would result in a more precise, but less accurate analysis as it selectively removes papers with highly significant results. This type of check, along with sensitivity analyses based on publication quality, risk of bias, methods used, or number of additional assumptions required, can be useful to find the optimal balance between bias and precision based on the available literature.” Question #2 I was surprised to see no mention of competing risks survival analyses, as far as I could tell, particularly given that I would expect prognostic questions to be much more important here. For example, both disease-free and recurrence outcomes would have other-cause mortality as a competing risk for prognostic research questions. I was expecting that the presence of a competing risk event would be explicitly mentioned at the end of the second paragraph on the second page, along with “patient withdrawal and loss to follow-up.” This also presents a general challenge in systematic reviews and meta-analyses in oncology, and many other areas, where both HRs and SHRs are likely to be reported. I wonder if the authors could incorporate competing risks at appropriate places into their manuscript. Response to Question #2: This is a good point. To address this, we have added the following statement to the manuscript as marked with underlining: Introduction: “Other common reasons that an event was not observed during a study include patient withdrawal, a competing risk event (an event that precludes observing the outcome of interest), and loss to follow-up.” Question #3 As a minor comment, I wonder if explaining the use of Phi in equation I (and subsequent) might not be a kindness to the non-statistical reader (this information is eventually revealed after Equation III), perhaps along with its value for (at least) 95% CIs. Similarly, some readers might appreciate being reminded about alpha here and subsequently. I think readers should be very clear about all notation immediately after reading any equation, even when the notation is as standard as it is here. Response to Question #3: Thank you for the suggestion. To address this, we have added details to the description of each equation so that all symbols are identified and defined. Question #4 As another minor comment, “univariate” and “multivariate” are generally used by (bio)statisticians to refer to dependent variables, with “univariable” and “multivariable” (referring to independent variables) perhaps being the intention on the second paragraph after Equation II (and elsewhere)? Response to Question #4: The preferred wording has been added in place of “univariate” and “multivariate” in all places where they are used in the manuscript. Question #5 More importantly, in this paragraph, I wonder if readers would benefit from some clearer explanation earlier in the manuscript of what to do when “multiple hazard ratios were reported” in terms of different sets of adjustment variables or when complete case and imputation results are presented, as examples? While RCTs ought not to include confounders, unless there are differential missing data mechanisms, favouring models that include variables used in allocation (stratification or minimisation variables) and competing exposures, while excluding variables potentially on the causal pathway, might be useful advice (perhaps brief with references to further information) here? The advice in “Specifically, the order of preference in this analysis for HRs was: unmatched unadjusted or univariate, unmatched adjusted, unmatched multivariate, matched unadjusted or univariate, matched adjusted, matched multivariate.” and in Figure 1 to favour unadjusted comparisons and those that break matching both go against standard (bio)statistical advice in my view (as a biostatistician) and I think need very careful and well-referenced justification in the body of the manuscript here. “Downgrading” all reported associations to the lowest common denominator, which seems to be the goal here I think, does not strike me as the obvious preference and needs some strong justification in my view. I appreciate that more discussion of this point is in the supplement (S3), but I think this needs to be incorporated into the manuscript, even if only briefly. Response to Question #5: After further consideration, we agree with the reviewer. We have made changes throughout the manuscript to reflect that papers and analyses that have adequately controlled confounding should be prioritized. All numbers and % were updated to reflect the change to limiting the analysis to adjusted/balanced data only. The changes to the manuscript are as follows: Abstract: “In our example, use of the proposed methodology would allow for the increase in data inclusion from 108 hazard ratios reported to 240 hazard ratios reported or estimated, resulting in an increase of 122%.” Introduction: “The goal in our framework development was to maximize the number of included studies while limiting bias, to provide clear guidelines, and improve agreement in dual data extraction of each individual manuscript.” Method 1 methods: “It is important to determine an extraction preference a priori for when more than one hazard ratio is reported. Our criterion was to maximize group size because analyses using entire populations account for the relative frequency of case types, severity of disease, surgeon experience, etc. and the results are more generalizable. For these reasons, we prioritized an adjusted HR using the largest sample that adequately addresses confounding (ie. whole patient population over a matched patient cohort when matching decreased the sample size).” Quality Control Methods: “When more than one analysis was reported, we selected the largest available analysis that adequately addressed cohort differences using the same hierarchy as listed above. When no adjustment or matching was performed, the comparability of the groups was determined by comparing baseline values for a list of covariates potentially related to oncologic outcomes. These two subgroups were analysed and presented separately as well as combined for a total pooled result to identify any instances where adjustment provided a different result from analyses with comparable groups.” Quality Control Methods: deleted the following section: “A comparison of the results restricted to the “matched/balanced” studies to the overall results allowed us to assess the impact of selection bias and other kinds of confounding. Alternatively, when the goal was to maximize the number of included papers and the number of included patients, the selection differed slightly. For example, when hazard ratios were not presented, the available information was usually based on the unadjusted empirical estimates of the cumulative distribution function (Kaplan-Meier (KM) estimates). This information is most similar to univariable proportional hazards models with no covariate adjustment, so when multiple HR estimates were presented, the univariable ones were used. Specifically, the order of preference in this analysis for HRs was: unmatched unadjusted or univariable, unmatched adjusted, unmatched multivariable, matched unadjusted or univariable, matched adjusted, matched multivariable.” Advantages of Method Hierarchy: “When limiting the analysis to papers that used adjustments or matching to account for differences between groups and papers where the groups were comparable, use of Methods 2-4 increased the available HRs from 108 (Method 1) to 240 HRs (Methods 1-4), facilitating an increase of 122%. Method 1 was the most commonly used, accounting for about 45% of HRs across the various outcomes. About 15% of hazard ratios were derived using Method 2. Method 3 was the second most common with 28%, and Method 4 was the least common with 12%.” Discussion: “In light of the paucity of RCTs, the question is how to have confidence in a result based primarily on observational data. We are assuming that the benefits of increasing the number of publications outweigh the additional risk of bias that may arise from including non-randomized comparison papers reporting long-term outcomes of interest. Formal assessment of risk of bias is standard in reporting results, and tools are available for both randomized (e.g. Cochrane) and non-randomized (e.g. Robins-I, Ottawa Newcastle) studies. These, or other measures of bias risk along with sensitivity analyses, could be used to further investigate the trade-off between including a broader cross section of the literature and the increased risk of bias.” Discussion: “Randomization remains the best technique for the removal of systemic imbalances, and whether observational data can produce results with similar validity will continue to be hotly contested (as noted by a recent publication titled, “The magic of randomization versus the myth of real-world evidence”(24)). However, limiting our analysis to papers that performed adjustment or matching or to papers where the cohorts were comparable helps mitigate these issues.” Figure 1: changed order of HR preference in the decision tree to: adjusted, then matched, then unadjusted/unmatched. Question #6: Method 4 seems the most open to challenge, also not being a Cochrane “standard” approach, and as the authors note this make assumptions including there being no censoring (or at least censoring mechanisms that are equivalent across treatments). While I appreciate their comment about including this as a limitation, I wondered if a sensitivity analysis with such studies being removed might not also be an appropriate treatment here. Why is using median survival not explicitly included in Figure 1 given its discussion in the text (this involves two time points rather than one and so seems distinct to me)? Response to Question #6: We agree that Method 4 is the most open to challenge and we agree that a sensitivity analysis excluding this method could be important, especially if strong statements about the pooled results will be made. Therefore, we have added the following statement to the Method 4 Methods section of the manuscript as follows: “We strongly recommend a sensitivity analysis excluding Method 4 to understand the possible impact of these assumptions on the overall conclusions.” We also performed an accuracy check for the indirect methods, including method 4, as mentioned above in our response to question #1 There were very few instances where median survival was the only data provided in our literature, and even though there are two different time points, there is still only one time point per cohort and this method also assumes no censoring, so we felt that it was appropriate to group it with Method 4. We added this to the manuscript as follows: “Method 4 was used in instances when a time-dependent analysis was performed, but it was not possible to account for the censoring distribution because the reported information was limited to a single time point per cohort, either in the form of Kaplan-Meier estimates, or median survival.” We are aware that median survival is more commonly reported in other clinical settings, so we have changed it to Method 4b and added a separate mention to Figure 1 to read: “Are number of events, Kaplan-Meier estimates for median survival available?” ->Method 4B: Assume no censoring, constant event rate. Use median survival in each group to estimate HR.” We have also added the equations for calculating the HR and confidence interval from median survival to the manuscript as follows: (please see new equations in 'response to reviewers' file) Where HR is hazard ratio, SE is standard error, “Ln” denotes the natural logarithm (loge), Or and Oc are the number of observed events for robotic and comparison groups, respectively, and CI is confidence interval.” Question #7 I think all four methods would be more likely to be understood by readers, and so used appropriately, if “worked” examples of each were added to the body of the manuscript. Methods I and II should be simple enough here, method III would be nicely illustrated with a figure, and method IV could be usefully shown using both survival through to a fixed timepoint and median survival. I do appreciate the references the authors have included, but I think that the greatest value of this manuscript would be as a guide and a tutorial at the same time, and so, on that basis, I feel that it should be reasonably self-contained (as much as is possible). I appreciate that the supplement (S3) includes much important information about the practice of using these methods, but even there, I’d like to see the essentials moved into the manuscript as I fear many readers will not read the supplements in detail (or will not connect some of the “tips” to the process explained within the body of the manuscript). Your rule to use the most conservative HR, for example, might be useful enough to warrant being included in the manuscript. Similarly, dealing with p-values reported as inequalities, seems rather crucial to me. Response to Question #7: We have made sure that all of the assumptions and tips listed in Supplemental Appendix S3 have been added to the main body of the manuscript, except for those specific to biochemical recurrence in prostate cancer as follows (additions marked with underlining): Methods to Extract Hazard Ratios: “The individual Methods 1-4 are described in detail below. Methods 1 through 3 have previously been described(4, 7, 10) and are recommended by the Cochrane Handbook(8). Our base assumption was that hazard ratios are a valid comparison of overall risk between groups in directionality and magnitude even when the hazards are not proportional, but statements quantifying the comparisons (e.g., a 5 x higher risk) should not be made in the case of non-proportionality. Our main rules were that 1) all available data, outcome definitions, and stated conclusions were utilized to determine the most valid data, method, and p-value to use, and to check the accuracy of Method 2-4 calculations, 2) when there was a judgement call needed, we selected the method that was the most conservative (most disfavored) for the cohort of interest.” Method 2: “For overall survival, the total number of deaths may be calculated by summing across causes of death (e.g., dead of other (DOO) + dead of disease (DOD)), subtracting the number of alive from the total sample size, or by calculating from a proportional death rate. For composite endpoints such as disease-free survival, it is important to be cautious to avoid double-counting patients that experienced multiple events (a patient that experiences recurrence and death would be counted twice if recurrence event n and a death event n were summed), but if there is no evidence it would be inaccurate, DOO + DOD + alive with disease (AWD) or overall mortality + (total recurrence minus DOD) equations could be used.” Method 3: “The use of Method 3 is associated with additional considerations. If the number of events is few enough, manually counting them may be an option, but the result should be confirmed by manually calculating the KM estimate. However, the quality of the published image affects the ability to accurately count or digitize the Kaplan-Meier curve.” “If no KM curve is shown, but the time of each event was reported along with summary information about the follow-up distribution, an approximation to the KM curve can be constructed manually.” “The Log-Rank p-value can be used to adjust censoring in the Guyot algorithm when n at risk over time is not reported.” Method 4: “We determined a priori which timepoint would be used when several Kaplan-Meier estimates were provided. We preferentially used the latest timepoint. Method 2 was then applied to these estimated event counts. This approach assumes no censoring, that the survival curves do not cross after the estimated time point, and that the hazards are relatively proportional…“ “We also utilized the conclusions of the authors to determine if this approach would accurately reflect the overall comparison between cohorts, and discrepancies were cause for excluding the data if the results and conclusions conflicted and the correct result was unclear.” “Using the median survival produces a reasonable estimate when the cohort sizes are similar and when there is a constant event rate.” Note that the key difference between Method 2 and Methods 3 and 4a is that in Method 2, the number of events is explicitly reported, while in Methods 3 and 4a, the number of events is estimated. Thus, we have added a table showing the calculation for Method 2 in full detail and we have added a description in a figure of how to estimate the number of events from a Kaplan-Meier curve using the Guyot algorithm. We have also added illustrated in text how to estimate the number of events using the Kaplan-Meier probability, and these estimated event numbers can be plugged into the Method 2 calculator to obtain Method 3 and 4a HR estimates. We have also added a table illustrating how to derive an HR estimate from the median survival times. We hope that these additions will satisfy the request for “worked” examples. We have added the above content in an additional section in the manuscript titled: “Worked Example” with corresponding tables and figures as follows: “Worked Example: We illustrate the four methods described above based on a simulated robotic versus open data set and we compare the resulting HR estimates (Table 2). For Method 1 the “reported” HR is: 1.47 [1.14, 1.9] with the robotic group as the reference. To switch to the open group as the reference, HR=1/1.47 [1/1.9, 1/1.14] = 0.68 [0.53, 0.88]. Table 3 shows the worked example for Method 2, with the event n as reported in the paper entered in rows 7 (robotic) and 8 (open). The data for rows 7+8 can be obtained from one of three sources, directly from the manuscript (Method 2), estimated from the KM curve and Guyot algorithm (Method 3-Figure 4), and calculated from the KM survival estimate at the latest time point (Method 4a) by multiplying the survival estimate with n at risk for # alive, and then subtracting from n at risk to get the estimated number of patients who died. For the simulated data set, KM survival estimates at 3-years were 58.6% Robotic versus 44.5% Open, so the calculation would be 300-(58.6% x 300)=124 for the robotic group and 300-(44.5% x 300)=167 (Figure 3). For Method 4b, median survival and the event n can be used to calculate the HR and CI (Table 4).” Table 3: Worked Example using Method 2: Hazard ratio calculated using event counts D F Example with equations Example with values Raw Data Robotic vs Open Robotic vs Open 4 Rr (Total number of patients: robotic) 300 300 5 Rc (Total number of patients: control) 300 300 7 Or (# deaths reported: robotic) 105 105 8 Oc (# deaths reported: control) 133 133 9 Log-Rank p-value (KM or Cox PH) 0.003 0.003 Calculations 11 Estimated death rate: robotic =D7/D4 0.35 12 Estimated death rate: control =D8/D5 0.443 13 Difference in est. death rate (r-c) =D11-D12 -0.093 14 Direction of difference (enter: 1 if D13 is positive or -1 if D13 is negative) -1 -1 16 eq. (V) = Vr = (Ototal*Rr*Rc)/(Rr+Rc)2 =(((D7+D8)*D4*D5)/((D4+D5)^2)) 59.5 17 eq. (VI) = Variance Ln(HR) = 1/Vr =1/D16 0.0168 18 eq. (VII) = Or - Er = (√(Ototal*Rr*Rc)/(Rr+Rc))*�  -1(1-p-value/2)*(direction of difference) =(SQRT((D7+D8)*D4*D5)/(D4+D5))* (NORM.S.INV(1-D9/2)*D14) -22.89 19 eq. (VIII) = ln(HR)=(Or-Er)/Vr =D18/D16 -0.385 21 HR=eLn(HR) =EXP(D19) 0.68 22 95% CI Lower = eLn(HR)-1.96*√(variance Ln(HR)) =EXP(D19-1.96*SQRT(D17)) 0.53 23 95% CI Upper = eLn(HR)+1.96*√(variance Ln(HR)) =EXP(D19+1.96*SQRT(D17)) 0.88 HR=Hazard Ratio, CI=Confidence Interval, eq. = equation, est. = estimated Table 4 Worked Example using Method 4b: Hazard Ratio calculated using median survival estimates D F Example with equations Example with values Raw Data Robotic vs Open Robotic vs Open 4 Total number of patients: robotic 300 300 5 Total number of patients: control 300 300 7 Or (# deaths reported: robotic) 105 105 8 Oc (# deaths reported: control) 133 133 9 MSr (Median Survival): robotic) 3.8 3.8 10 MSc (Median Survival): control) 2.5 2.5 Calculations 15 eq. (X) = HR = MSc/MSr =D10/D9 0.66 16 eq. (XI) = Standard Error Ln(HR) = √(1/Or+1/Oc) =SQRT(1/D7+1/D8) 0.13 17 eq. (XII) = CI Lower = ln(HR) - 1.96*SE Ln(HR) =LN(D12)-1.96*D13 -0.67 18 eq. (XII) = CI Lower = ln(HR) + 1.96*SE Ln(HR) =LN(D12)+1.96*D13 -0.16 20 Exponentiate CI Lower = eln(HR) - 1.96*SE Ln(HR) =EXP(D14) 0.51 21 Exponentiate CI Lower = eln(HR) + 1.96*SE Ln(HR) =EXP(D15) 0.85 HR=Hazard Ratio, CI=Confidence Interval, eq. = equation Question #8 I don’t wish to be snobbish, but the 3D exploded pie chart in Figure 4 seems to me to be a very questionable way of presenting these four values (the information to ink ratio is very low here) and could easily be deleted given that this information is presented in the text. Response to Question #8: The figure and all mention of it has been removed from the manuscript. Question #9 As a final minor comment, I suggest not using “subjects”, as in “when the goal was to maximize the number of included papers and the number of included subjects” (but note at least four other such uses) and instead using “patients” or (if you feel it is appropriate) “participants”. Response to Question #9: We have changed all instances of “subjects’ or “subject” to “patients” or “patient” throughout the manuscript. Reviewer #2: Highly appreciated the author’s effort to consolidate methods for calculating hazard ratio, the most appropriate statistic for a meta-analysis of time dependent outcomes, with clear instructions, a comprehensive worked example, practical tips, and additional calculations. Please address the following concerns Query 1: Pages 10 to 12: Use of roman numerals to quote equation number prior to the equation. It may be kept after the equation in the right margin for further use Response to Query #1: The roman numerals associated with the equations have been moved farthest right. Query 2: P14, Line 6 : While describing Method 3, the authors specified that “When the overall p-value seemed to reflect pairwise differences (assumption of similar variances), then the 3-way p-value was assumed to be a reasonable approximation of each pairwise comparison”. This may increase type 1 error rate due to multiple pairwise comparison. How do you address the difference in type I error rate due to each pairwise comparison?? Response to Query #2: We agree and we have added a statement to the manuscript in the Method 3 section as follows: “An additional limitation of this approach is that it may result in type 1 error inflation because the overall comparison is not equivalent to two comparisons to a control.” Query 3: Decision Tree for Hazard Ratio Extraction specified in Figure 1 is remarkable. However the method 4 mentioned by the authors hold the assumption of no censoring; which is difficult to consider in real research studies. How do you justify this? Response to Query #3: Please see response to review 1 question #1 and question #6 regarding Method 4. We agree that an assumption of no censoring is unrealistic and have recommended sensitivity analyses to understand the impact of the use of Method 4. This assumption will lead to the fewest problems when the follow up duration is short, and the censoring mechanism is the same in both arms._________________ Submitted filename: Response to Reviewers.docx Click here for additional data file. 9 Dec 2021

PONE-D-21-18912R1

Methodology to Standardize Heterogeneous Statistical Data Presentations for Combining Time-to-Event Oncologic Outcomes

PLOS ONE Dear Dr. Slee, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jan 23 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Mona Pathak, PhD Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #3: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #3: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #3: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #3: No ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #3: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #3: As someone who's worked on extracting this kind of data for systematic reviews, I was happy to see this outline of different ways to extract survival outcomes from publications. It was especially nice to see R code for the method of Guyot et al (in the supplement). The worked examples will also certainly be helpful for many researchers doing this sort of data extraction. Would it be possible to include R and / or SAS code for these examples int he supplement? This would further aid researchers in performing their data extraction without mistakes in simple bits of code. For example, I see that the function NORM.S.INV was used, but I'm not sure without looking it up in both excel and R help files which is the corresponding R function. Finally, in method 1, it is noted that "For these reasons, we prioritized an adjusted HR using the largest sample that adequately addresses confounding (ie. whole patient population over a matched patient cohort when matching decreased the sample size)." This surprised me, as the whole point of matching is to reduce confounding in studies, even while reducing sample size. While the "whole patient population" will of course have a larger sample size, there is no guarantee that an analysis of such data will in fact bring the least biased results, and in fact failing to account for such variables (through matching or adjusting in a regression analysis) would likely lead to biased estimates of the true HR. I would therefore argue, as another reviewer has previously also argued, that priority for HR starts with adjusted or matched analyses, and only if no other data are available, then use the unadjusted HR estimates. The overall goal of a systematic review and possibly included meta-analysis is to obtain a best unbiased estimate of the treatment effect and therefore data extraction for such a SR/MA should also be clearly focused on that. The Cochrane Handbook also appears to prefer adjusted to unadjusted treatment effects, see e.g. https://training.cochrane.org/handbook/current/chapter-06#section-6-3 ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #3: Yes: Sarah R Haile [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

5 Jan 2022 Reviewer #3: As someone who's worked on extracting this kind of data for systematic reviews, I was happy to see this outline of different ways to extract survival outcomes from publications. Reviewer #3 Comment #1: It was especially nice to see R code for the method of Guyot et al (in the supplement). The worked examples will also certainly be helpful for many researchers doing this sort of data extraction. Would it be possible to include R and / or SAS code for these examples in the supplement? This would further aid researchers in performing their data extraction without mistakes in simple bits of code. For example, I see that the function NORM.S.INV was used, but I'm not sure without looking it up in both excel and R help files which is the corresponding R function. Response to Reviewer #3 Comment #1: We originally performed the HR estimate calculations in excel only; however, we went back and created code de novo in R for all of the equations needed for Methods 1-4, and we are providing this R code in the supplement as appendix S3 and re-naming the tips and tricks appendix as S4. This new appendix is cited in the “worked example” section of the paper as follows: “We are also providing the R code in Supplemental Appendix S3.” And in the discussion as follows: “Our methods add to the list of previously published techniques for analyzing time-to-event outcomes (4, 7,18), and include details for implementing our strategies (Supplemental Appendices – S1 Guyot code, S3 R code for equations of interest, and S4 Data extraction & tricks).” The addition to the supplement is as follows: Supplemental Appendix S3: R functions for HR estimate calculations #Method 1: reverse reference group RevRef <- function(HR.Ref1, HRCIL.Ref1, HRCIU.Ref1) { #Display for HR (CI) as entered DISP1<-paste("HR (CI), Ref. Grp. 1: ", round(HR.Ref1,digits=4), " (", round(HRCIL.Ref1,digits=4), ", ", round(HRCIU.Ref1,digits=4), ")" ) print(DISP1) DISP2<-paste("HR (CI), Ref. Grp. 2: ", round(1/HR.Ref1,digits=4), " (", round(1/HRCIU.Ref1,digits=4), ", ", round(1/HRCIL.Ref1,digits=4), ")" ) print(DISP2) rm(HR.Ref1, HRCIL.Ref1, HRCIU.Ref1, DISP1, DISP2) } #Example of function call #Syntax: RevRef(HR with original ref, HR CI lower limit (orig.), HR CI upper limit (orig.)) #Example: RevRef(1.47, 1.14, 1.90) ##################### #Method 1: Deriving CI from Reported HR and P-value GetCI <- function(HR, Pval) { SELnHR <- log(HR)/qnorm((1-Pval/2), mean=0, sd=1) HR.CIL <- exp(log(HR) + SELnHR*1.96) HR.CIU <- exp(log(HR) - SELnHR*1.96) DISP1 <- paste("HR (CI), p-value: ", round(HR,digits=4), " (", round(HR.CIL,digits=4), ", ", round(HR.CIU,digits=4), "), p = ", round(Pval,digits=4)) print(DISP1) rm(HR, Pval, SELnHR, HR.CIL, HR.CIU, DISP1) } #Example of function call #Syntax: GetCI(Hazard Ratio, Log-Rank p-value) #Example: GetCI(.68, 0.0032) ##################### #Method 2: Hazard Ratio as calculated using event counts M2EvCt <- function(Rr, Rc, Or, Oc, Pval) { Dth.r <- Or / Rr Dth.c <- Oc / Rc Dth.Diff = Dth.r - Dth.c Pmult <- ifelse(Dth.Diff < 0, -1, 1) inv.Vr <- 1/(((Or + Oc)*Rr*Rc)/((Rr + Rc)^2)) O_E <- (sqrt((Or + Oc)*Rr*Rc)/(Rr + Rc)*(qnorm((1-Pval/2), mean=0, sd=1))*Pmult) LnHR <- O_E*inv.Vr HR <- exp(LnHR) HR.CIL <- exp (LnHR - 1.96*sqrt(inv.Vr)) HR.CIU <- exp (LnHR + 1.96*sqrt(inv.Vr)) DISP1 <- paste("HR (CI), p-value: ", round(HR,digits=4), " (", round(HR.CIL,digits=4), ", ", round(HR.CIU,digits=4), "), p = ", round(Pval,digits=4)) print(DISP1) rm(Rr, Rc, Or, Oc, Pval, Dth.r, Dth.c, Dth.Diff, Pmult, inv.Vr, O_E, LnHR, HR, HR.CIL, HR.CIU, DISP1) } #Example of function call #Syntax: M2EvCt(Total # group 1 patients , Total # group 2 patients, # events group 1, # events group 2, Log-Rank p-value) #Example: M2EvCt(300, 300, 105, 133, 0.003) ##################### #Method 4a: Hazard Ratio as calculated using KM estimates M4aHR.KM <- function(Rr, Rc, Surv.KMr, Surv.KMc, Pval) { Or <- (Rr - (Rr*Surv.KMr)) Oc <- (Rc - (Rc*Surv.KMc)) Dth.r <- Or / Rr Dth.c <- Oc / Rc Dth.Diff = Dth.r - Dth.c Pmult <- ifelse(Dth.Diff < 0, -1, 1) inv.Vr <- 1/(((Or + Oc)*Rr*Rc)/((Rr + Rc)^2)) O_E <- (sqrt((Or + Oc)*Rr*Rc)/(Rr + Rc)*(qnorm((1-Pval/2), mean=0, sd=1))*Pmult) LnHR <- O_E*inv.Vr HR <- exp(LnHR) HR.CIL <- exp (LnHR - 1.96*sqrt(inv.Vr)) HR.CIU <- exp (LnHR + 1.96*sqrt(inv.Vr)) DISP1 <- paste("HR (CI), p-value: ", round(HR,digits=4), " (", round(HR.CIL,digits=4), ", ", round(HR.CIU,digits=4), "), p = ", round(Pval,digits=4)) print(DISP1) rm(Rr, Rc, Or, Oc, Surv.KMr, Surv.KMc, Pval, Dth.r, Dth.c, Dth.Diff, Pmult, inv.Vr, O_E, LnHR, HR, HR.CIL, HR.CIU, DISP1) } #Example of function call #Syntax: M4aHR.KM(Total # group 1 patients , Total # group 2 patients, KM Survival group 1, KM Survival group 2, Log-Rank p-value) #Example: M4aHR.KM(300, 300, 0.586, 0.445, 0.003) ##################### #Method 4b. Hazard Ratio calculated using median survival estimates M4bMedSurv <- function(Or, Oc, MSr, MSc) { HR <- MSc/MSr SELnHR <- sqrt((1/Or)+(1/Oc)) LnHR <- log(HR) HR.CIL <- exp(LnHR - 1.96*SELnHR) HR.CIU <- exp(LnHR + 1.96*SELnHR) DISP1 <- paste("HR (CI): ", round(HR,digits=4), " (", round(HR.CIL,digits=4), ", ", round(HR.CIU,digits=4),")") print(DISP1) rm(Or, Oc, MSr, MSc, SELnHR, LnHR, HR, HR.CIL, HR.CIU, DISP1) } #Example of a function call #Syntax: M4bMedSurv(# events group 1, # events group 2, Median survival for group 1, Median survival for group 2) #Example: M4bMedSurv(105, 133, 3.8, 2.5) Reviewer #3 Comment #2: Finally, in method 1, it is noted that "For these reasons, we prioritized an adjusted HR using the largest sample that adequately addresses confounding (ie. whole patient population over a matched patient cohort when matching decreased the sample size)." This surprised me, as the whole point of matching is to reduce confounding in studies, even while reducing sample size. While the "whole patient population" will of course have a larger sample size, there is no guarantee that an analysis of such data will in fact bring the least biased results, and in fact failing to account for such variables (through matching or adjusting in a regression analysis) would likely lead to biased estimates of the true HR. I would therefore argue, as another reviewer has previously also argued, that priority for HR starts with adjusted or matched analyses, and only if no other data are available, then use the unadjusted HR estimates. The overall goal of a systematic review and possibly included meta-analysis is to obtain a best unbiased estimate of the treatment effect and therefore data extraction for such a SR/MA should also be clearly focused on that. The Cochrane Handbook also appears to prefer adjusted to unadjusted treatment effects, see e.g. https://training.cochrane.org/handbook/current/chapter-06#section-6-3 Response to Reviewer #3 Comment #2: We agree that matched or adjusted HR should be used preferentially over an unadjusted HR. We apologize for the lack of clarity in our statement. Our comment was meant to address the case when a paper reported both an adequately adjusted HR and a matched HR for the same data set. To clarify, we have modified the statement in the “Method 1” section as follows: "When multiple hazard ratios were reported, the statistical analysis that produced the hazard ratio was also captured (i.e., univariable, multivariable) to follow the extraction priority. It is important to determine an extraction preference a priori for when more than one hazard ratio is reported. Our criterion was to prioritize adjusted or matched analyses over unadjusted data, and when both adjusted and matched analyses were available, to maximize group size, because analyses using entire populations account for the relative frequency of case types, severity of disease, surgeon experience, etc. and the results are more generalizable. For these reasons, we prioritized an adjusted HR using the largest sample that adequately addresses confounding (ie. adjusted analysis using the whole patient population over a matched patient cohort when matching decreased the sample size)." We also clarified that an adjusted or matched analysis is preferable to an unmatched analysis by modifying the Quality Control Methods section of the manuscript using wording similar to that of the reviewer, as follows: “When more than one analysis was reported, we selected an adjusted or matched analysis preferentially, and only if no other data were available, unadjusted data; we chose the largest available analysis that adequately addressed cohort differences using the same hierarchy as listed above. When no adjustment or matching was performed, the comparability of the groups was determined by comparing baseline values for a list of covariates potentially related to oncologic outcomes.” We also clarified this priority in supplemental appendix S4: Assumptions, Rules, and Tips as follows: “Our Rule: Use an adjusted or matched HR over an unmatched HR. In instances where both an adjusted and a matched HR are provided, to maximize group size, use an adjusted HR using the whole patient population over a matched patient cohort when matching decreased the sample size.” Journal Requirements - References: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Response to Journal Requirements: All references have been checked for completeness, and citations have been modified to add details as shown in the tracked changes version of the manuscript. There are no retracted references; however, the Parmar 1998 reference had an Erratum/correction published that corrected one of the equations that was not relevant for our work, and those details have been added to the end of the reference as follows: “Parmar MKB, Torri V, Stewart L. Extracting summary statistics to perform meta‐analyses of the published literature for survival endpoints. Statistics in medicine. 1998;17(24):2815-34. Corrected: Stat Med. 2004 Jun 15;23(11):1817.” We have confirmed that all references can be found by searching Pubmed or google using the citations as currently listed. There was a glitch in our reference handling software; therefore, the following references have been corrected as follows: Reference #2 has been corrected to: “2. Tewari A, Sooriakumaran P, Bloch DA, Seshadri-Kreaden U, Hebert AE, Wiklund P. Positive surgical margin and perioperative complication rates of primary surgical treatments for prostate cancer: a systematic review and meta-analysis comparing retropubic, laparoscopic, and robotic prostatectomy. Eur Urol. 2012;62(1):1-15.” Reference #5 has been corrected to: “5. Williamson PR, Smith CT, Hutton JL, Marson AG. Aggregate data meta-analysis with time-to-event outcomes. Stat Med 2002;21:3337-51.” The order of Ref #7 Parmar 1998 and Ref #8 Higgins Chapter 6 has been swapped. Reference #12 for the Cochrane Handbook has been updated to the latest online version: “Ref #12 Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, et al. Cochrane handbook for systematic reviews of interventions. 6.2 ed (updated February 2021): Cochrane, 2021. Available from www.training.cochrane.org/handbook.” Ref #18 has been removed and references 19-24 have been re-numbered to references 18-23. Guidelines for resubmitting figure files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. Response to Guidelines for resubmitting figure files: The three figures to be included in the main body of the manuscript have been run through the PACE tool. The PACE tool changed the resolution to 300 ppi, converted them to a valid TIF file, and changed the names to meet requirements. Submitted filename: Response to Reviewers.docx Click here for additional data file. 25 Jan 2022 Methodology to Standardize Heterogeneous Statistical Data Presentations for Combining Time-to-Event Oncologic Outcomes PONE-D-21-18912R2 Dear Dr. Slee, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication after incorporation of suggestions by reviewer and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Mona Pathak, PhD Academic Editor PLOS ONE Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #3: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #3: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #3: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #3: No ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #3: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #3: Thank you for addressing my comments, and for adding the code. It would be more readable for readers if the code did not round all of the numbers, and instead printed them out as usual. I would also point out that it is not necessary to rm() objects at the end of a function, as R does not keep things created during a function call. See some example code I adapted for RevRef and GetCI below, though the other functions should be adapted with readability in mind, that is so that readers see what the function is doing without worrying about how to round or print the numbers. RevRef <- function(HR.Ref1, HRCIL.Ref1, HRCIU.Ref1) { Hazard.Ratio <- c(HR.Ref1, 1/ HR.Ref1) lb <- c(HRCIL.Ref1, 1 / 1/HRCIU.Ref1) ub <- c(HRCIU.Ref1, 1 / 1/HRCIL.Ref1) data.frame(reference = c("Group 1", "Group 2"), Hazard.Ratio, lb, ub) } RevRef(1.47, 1.14, 1.90) #Method 1: Deriving CI from Reported HR and P-value GetCI <- function(HR, Pval) { SELnHR <- log(HR)/qnorm((1-Pval/2), mean=0, sd=1) HR.CIL <- exp(log(HR) + SELnHR*1.96) HR.CIU <- exp(log(HR) - SELnHR*1.96) c("HR" = HR, "lb" = HR.CIL, "ub" = HR.CIU, p.value = Pval) } GetCI(.68, 0.0032) ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #3: Yes: Sarah R Haile 10 Feb 2022 PONE-D-21-18912R2 Methodology to Standardize Heterogeneous Statistical Data Presentations for Combining Time-to-Event Oncologic Outcomes Dear Dr. Slee: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Mona Pathak Academic Editor PLOS ONE

18 in total

1. Aggregate data meta-analysis with time-to-event outcomes.

Authors: Paula R Williamson; Catrin Tudur Smith; Jane L Hutton; Anthony G Marson
Journal: Stat Med Date: 2002-11-30 Impact factor: 2.373

2. Interim analysis on survival data: its potential bias and how to repair it.

Authors: Hans C van Houwelingen; Cornelis J H van de Velde; Theo Stijnen
Journal: Stat Med Date: 2005-09-30 Impact factor: 2.373

3. How to obtain the confidence interval from a P value.

Authors: Douglas G Altman; J Martin Bland
Journal: BMJ Date: 2011-08-08

4. The use of restricted mean survival time to estimate the treatment effect in randomized clinical trials when the proportional hazards assumption is in doubt.

Authors: Patrick Royston; Mahesh K B Parmar
Journal: Stat Med Date: 2011-05-25 Impact factor: 2.373

5. A systematic review and meta-analysis of robotic versus open and video-assisted thoracoscopic surgery approaches for lobectomy.

Authors: Katie E O'Sullivan; Usha S Kreaden; April E Hebert; Donna Eaton; Karen C Redmond
Journal: Interact Cardiovasc Thorac Surg Date: 2019-04-01

6. Updated guidance for trusted systematic reviews: a new edition of the Cochrane Handbook for Systematic Reviews of Interventions.

Authors: Miranda Cumpston; Tianjing Li; Matthew J Page; Jacqueline Chandler; Vivian A Welch; Julian Pt Higgins; James Thomas
Journal: Cochrane Database Syst Rev Date: 2019-10-03

7. A guide on meta-analysis of time-to-event outcomes using aggregate data in vascular and endovascular surgery.

Authors: George A Antoniou; Stavros A Antoniou; Catrin Tudur Smith
Journal: J Vasc Surg Date: 2020-03 Impact factor: 4.268

8. Comparison of aggregate and individual participant data approaches to meta-analysis of randomised trials: An observational study.

Authors: Jayne F Tierney; David J Fisher; Sarah Burdett; Lesley A Stewart; Mahesh K B Parmar
Journal: PLoS Med Date: 2020-01-31 Impact factor: 11.069

9. The Magic of Randomization versus the Myth of Real-World Evidence.

Authors: Rory Collins; Louise Bowman; Martin Landray; Richard Peto
Journal: N Engl J Med Date: 2020-02-13 Impact factor: 91.245

10. Practical methods for incorporating summary time-to-event data into meta-analysis.

Authors: Jayne F Tierney; Lesley A Stewart; Davina Ghersi; Sarah Burdett; Matthew R Sydes
Journal: Trials Date: 2007-06-07 Impact factor: 2.279