The purpose of the present study was to perform the first examination of the utility of p values and the degree of statistical fragility in the hip arthroscopy literature by applying both the Fragility Index (FI) and the Fragility Quotient (FQ) to dichotomous comparative trials. We hypothesized that dichotomous comparative trials evaluating categorical outcomes in the hip arthroscopy literature are statistically fragile. METHODS: The PubMed and MEDLINE databases were queried from 2008-2018 for comparative studies evaluating dichotomous data in the hip arthroscopy literature. The present analysis included both randomized controlled trials (RCTs) and non-RCTs in which dichotomous data and associated p values were reported. Fragility analysis was performed with use of the Fisher exact test until an alteration of significance was determined. RESULTS: Of the 5,836 studies screened, 4,156 met the search criteria, with 52 comparative studies included for analysis. One hundred and fifty total outcome events with 33 significant (p < 0.05) outcomes and 117 nonsignificant (p ≥ 0.05) outcomes were identified. The final FI incorporating all 150 outcome events from 52 comparative studies was only 3.5 (interquartile range, 2 to 6), with an associated FQ of 0.032 (interquartile range, 0.017 to 0.063). Twenty-two studies (42.3%) either failed to report loss to follow-up (LTF) data or reported LTF greater than the overall FI of 3.5. CONCLUSIONS: The peer-reviewed hip arthroscopy literature may not be as stable as previously thought, as the sole reliance on a threshold p value has proven misleading. We therefore recommend reporting of the FI and FQ, in conjunction with p values, to aid in the evaluation and interpretation of statistical robustness and quantitative significance in future comparative hip arthroscopy studies.
The purpose of the present study was to perform the first examination of the utility of p values and the degree of statistical fragility in the hip arthroscopy literature by applying both the Fragility Index (FI) and the Fragility Quotient (FQ) to dichotomous comparative trials. We hypothesized that dichotomous comparative trials evaluating categorical outcomes in the hip arthroscopy literature are statistically fragile. METHODS: The PubMed and MEDLINE databases were queried from 2008-2018 for comparative studies evaluating dichotomous data in the hip arthroscopy literature. The present analysis included both randomized controlled trials (RCTs) and non-RCTs in which dichotomous data and associated p values were reported. Fragility analysis was performed with use of the Fisher exact test until an alteration of significance was determined. RESULTS: Of the 5,836 studies screened, 4,156 met the search criteria, with 52 comparative studies included for analysis. One hundred and fifty total outcome events with 33 significant (p < 0.05) outcomes and 117 nonsignificant (p ≥ 0.05) outcomes were identified. The final FI incorporating all 150 outcome events from 52 comparative studies was only 3.5 (interquartile range, 2 to 6), with an associated FQ of 0.032 (interquartile range, 0.017 to 0.063). Twenty-two studies (42.3%) either failed to report loss to follow-up (LTF) data or reported LTF greater than the overall FI of 3.5. CONCLUSIONS: The peer-reviewed hip arthroscopy literature may not be as stable as previously thought, as the sole reliance on a threshold p value has proven misleading. We therefore recommend reporting of the FI and FQ, in conjunction with p values, to aid in the evaluation and interpretation of statistical robustness and quantitative significance in future comparative hip arthroscopy studies.
Hip arthroscopy was developed in 1931 by Burman for joint visualization and began gaining clinical applicability in the 1970s to 1980s as many introduced hip arthroscopy as a technique for diagnosis, management, and surgical intervention for several hip ailments[1-3]. This procedure has been recently popularized as a joint-preserving procedure for the management of femoroacetabular impingement (FAI), a term that was first coined by Ganz et al. in 2003[4]. Since its introduction, FAI has been rapidly identified as a major cause of hip pain and arthritis[5]. Subsequently, hip arthroscopy has become increasingly utilized, given its favorable outcomes[6,7]. Evidence-based medicine (EBM) has driven treatment protocols and surgical methods for all fields of medicine, including hip arthroscopy. EBM was first introduced by Cochrane in 1972[8] and was later expanded by Eddy under the principle of utilizing information on health and economic outcomes to guide clinical decision-making[9]. The adaptation of EBM has become central to modern medicine but is challenging given the ever-expanding body of published literature. Furthermore, instances of poor data and statistical integrity in orthopaedic research may compromise EBM guidance if they are not properly recognized[10]. This is especially true for relatively new techniques such as hip arthroscopy, for which the availability of high-powered studies may be limited.In the hip arthroscopy literature, dichotomous comparative trials produce the best available evidence to guide clinical decisions, with significance being determined by probability assessment, resulting in either the rejection of, or the failure to reject, the null hypothesis. This method produces an a priori p value threshold set at 0.05, thus representing a 5% likelihood that the difference is due to random chance. Despite its ubiquity, the p value has been met with criticism because of instances in which it may be overvalued without regard for factors such as sample size, loss to follow-up (LTF), or lack of sufficient power[11-13]. In these cases, a limited number of event reversals can change study significance. The fragility index (FI), first proposed by Feinstein in 1990 as the “unit fragility,” was developed to address the shortcomings of the p value and is expressed as the number of event reversals required to change study significance[14]. A low FI is indicated by only a few event reversals being required to reverse study significance, thus suggesting statistically fragile results. With the FI having been retrospectively applied to the literature, an alarming prevalence of statistical fragility has been identified across several disciplines and subspecialties[15-30]. Applying the FI in addition to p value analysis provides a much clearer picture of the stability of outcomes. However, the FI is an absolute measure, so it is independent of cohort size. Therefore, Ahmed et al. proposed the Fragility Quotient (FQ) as a means of determining the relative measure of fragility by dividing the FI by the total sample size[31]. Supplementing the p value with the FI and FQ provides a more comprehensive understanding of study stability by accounting for sample size. These stability metrics can aid readers in their critical evaluation of the literature while guiding clinical decision-making through evidence-based principles.The purpose of the present study was to perform the first examination of the utility of p values and the degree of statistical fragility in the hip arthroscopy literature by applying both the FI and FQ to dichotomous comparative trials. We hypothesized that dichotomous comparative trials evaluating categorical outcomes in the hip arthroscopy literature are statistically fragile.
Materials and Methods
The PubMed and MEDLINE databases were queried from 2008 to 2018 for comparative studies reporting dichotomous data in the hip arthroscopy literature, with utilization of the following search terms: “hip arthroscopy” OR “cam” OR “pincer” OR “labrum” OR “femoroacetabular impingement” OR “FAI” OR (“hip” AND “arthroscopy”) OR (“hip” AND “dysplasia”) OR (“hip” AND “cam”) OR (“hip” AND “pincer”) OR (“hip” AND “labrum”) OR (“hip” AND “femoroacetabular impingement”) OR (“hip” AND “FAI”). Journals included for analysis were The Journal of Bone and Joint Surgery (JBJS); The American Journal of Sports Medicine (AJSM); Arthroscopy: The Journal of Arthroscopic and Related Surgery (Arthroscopy); Knee Surgery, Sports Traumatology, Arthroscopy (KSSTA); and the Journal of Hip Preservation Surgery (JHPS). These 5 peer-reviewed journals were selected because of their prominence in the published hip arthroscopy literature. Thus, analysis of 11 years of data within these 5 journals provides a representative sample of peer-reviewed research in hip arthroscopy. The analysis included both randomized controlled trials (RCTs) and non-RCTs in which dichotomous data and associated p values were reported. Studies involving cadaveric, animal, in vitro, and non-dichotomous data, along with systematic reviews, were excluded from analysis. Outcome measures were reported as primary, secondary, or not specified, as specifically stated in each study that was included in the analysis. Outcomes that were reported as significant (p < 0.05) or not significant (p ≥ 0.05) were recorded and analyzed. LTF data were determined and documented. Fragility analysis was performed with use of the Fisher exact test until an alteration of significance was determined (Table I)[20]. For example, if an outcome was reported as significant, the number of events required to alter the p value to not significant was determined. Similarly, the number of events required to change an outcome from not significant to significant was determined. The resultant numerical value indicates the number required to reverse an outcome event and was recorded as the FI for that event. Additionally, all event reversals were determined and pooled, with the median value representing the FI for the entire study. The FQ was also calculated for each outcome by dividing the FI by the sample size. The total FQ for all outcomes as well as the FQ for RCTs and non-RCTs was determined. Interquartile ranges (IQRs) were determined to aid in the interpretation of the reported variability and dispersion of the data.Demonstration of Reversal of Significance with a Fragility Index of 1
Source of Funding
No external funding was acquired in support of this research.
Results
Of the 5,836 studies screened, 4,156 met the search criteria, with 52 comparative studies included in the analysis (Fig. 1). One hundred and fifty total outcome events with 33 significant (p < 0.05) outcomes and 117 with nonsignificant (p ≥ 0.05) outcomes were identified. For the 33 outcomes that were reported as significant, the median number of events required to change significance was only 4 (IQR, 1 to 9) (Table II). The FQ for significant outcomes was 0.025 (IQR, 0.010 to 0.082). For the 117 outcomes that were reported as nonsignificant, the number of events required to change significance was 3 (IQR, 2 to 5). The FQ for nonsignificant outcomes was 0.032 (IQR, 0.017 to 0.060). Therefore, there was no difference in statistical fragility between outcome events reported as significant as compared with those reported as nonsignificant. Of the 150 total outcomes, 34 (22.7%) were primary, 30 (20%) were secondary, and 86 (57.3%) were not specified. No difference was appreciated between primary, secondary, and not-specified outcomes, with an FI of 3 (IQR, 2 to 5), 3 (IQR, 2 to 6), and 4 (IQR, 2 to 5), respectively. The associated FQ was nearly identical, with values of 0.028, 0.047, and 0.032, respectively. Further subanalysis by journal did not demonstrate a correlation between study fragility and impact factor (IF), with the least-fragile findings realized in JBJS (IF, 4.578), with an FI of 9 and an associated FQ of 0.071 (Table III). This was followed by Arthroscopy (IF, 4.325), with an FI of 5 and an FQ of 0.068. AJSM (IF, 5.810), JHPS (IF, 1.917), and KSSTA (IF, 3.210) all demonstrated similar fragility, with an FI of 3 and an FQ of 0.032, 0.028, and 0.004, respectively. A subanalysis of comparative trial types identified a difference between the 141 non-RCT outcomes (FI, 3; IQR, 2 to 5) and the 9 RCT outcomes (FI, 6; IQR, 4.5 to 7). The final FI, incorporating all 150 outcome events from 52 comparative studies, was only 3.5 (IQR, 2 to 6). The final FQ was 0.032 (IQR, 0.017 to 0.063), indicating that the reversal of only 3.2 of 100 outcomes may change the study significance of the included RCTs and non-RCTs. Of the 52 included studies, 19 (36.5%) failed to report LTF data. Three studies (5.8%) reported LTF data greater than the overall FI of 3.5. Therefore, 42.3% of studies either failed to report LTF data or reported an LTF value that was greater than the overall FI. Subgroup analysis showed that 14.3% of RCTs and 40% of non-RCTs failed to report LTF data.
Fig. 1
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) study identification flowchart.
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) study identification flowchart.Fragility Data Based on Trial and Outcome CharacteristicsFragility Data Based on Journal Impact Factor
Discussion
In the present comprehensive evaluation of 52 hip arthroscopy comparative trials and 150 outcome events across 5 leading peer-reviewed orthopaedic journals, we demonstrated substantial fragility, with an overall median FI of only 3.5 and associated median FQ of just 0.032. An FI of 3.5 indicates that reversal of the outcome for just 4 patients would be sufficient to reverse significance. Accounting for sample size, an FQ of just 0.032 indicates a low level of trial stability as only 3.2 of 100 patients is the median number required to reverse significance across all 150 outcome events. Furthermore, 22 (42.3%) of the 52 studies failed to provide LTF data or presented an LTF value that was greater than the overall FI. This suggests that reversal of significance might have been realized by simply maintaining follow-up of all patients in the study. In combination with a low median FI and FQ, these compelling data demonstrate that the hip arthroscopy literature may be more fragile than previously recognized.A primary purpose of conducting evidence-based research is to improve our collective knowledge base and the quality of clinical care delivery. Information gained with regard to particular treatment strategies and patient outcomes allows for physicians to enter the shared decision-making process armed with objective data. Given that such data are heavily relied on for appropriate clinical management, it is crucial for the strength of significant findings be easily accessible and understood. The current and standard method with which to report significance is p value analysis. A p value of <0.05 confers significance and can be interpreted as a 95% probability that the result was not due to chance alone. In that scenario, one would reject the null hypothesis. However, if the statistical finding is fragile in nature, this may lead to an unintentional type-I (alpha) error. The converse also holds true in that failure to reject the null hypothesis in the setting of a fragile statistical finding may lead to a type-II (beta) error. Thus, the p value should not be utilized as a sole measure of effect. Rather, it should be utilized to aid in the interpretation of evidence, taking into consideration study design and methodological integrity. It is therefore necessary to provide an accurate assessment of a study’s statistical fragility in the published literature. As such, inclusion of both the FI and the FQ in the analysis of fragility of comparative trials provides clinicians with a more accurate and comprehensive understanding of trial significance. Although a direct comparison of RCTs and non-RCTs may not be appropriate given the differing integrity of study design, with non-RCTs representing vulnerability to both selection bias and confounding, we identified a difference in the fragility of RCTs as compared with non-RCTs. The FI for RCTs was found to be 6, whereas that for non-RCTs was only 3. In other words, the reversal of only 3 events in non-RCTs is sufficient to provide a reversal of significance, compared with 6 events in RCTs. Additionally, RCTs exhibited an FQ of 0.098, whereas while non-RCTs demonstrated an FQ of 0.028. These findings are consistent with previously published studies in the orthopaedic literature evaluating significance and fragility[22-30]. Khormaee et al. evaluated the fragility of dichotomous outcomes in 17 RCTs in the pediatric orthopaedic literature and identified a median FI of just 325. Evaniew et al., in an evaluation of 40 RCTs in the spine surgery literature, reported a median FI of only 2, with 75% of the trials demonstrating an FI of ≤322. Parisien et al. further investigated the statistical stability of 102 comparative RCTs and non-RCTs in the sports medicine literature and identified an FI of only 5 across 339 outcome events[26]. Khan et al., in a study of 48 primary outcomes in RCTs in the sports medicine literature, reported an even more fragile FI of only 224. Furthermore, Parisien et al., in an evaluation of 775 outcome events across 80 RCTs and 118 non-RCTs in the orthopaedic trauma literature, identified an FI and FQ of just 5 and 0.046, respectively[27]. Forrester et al., in an examination of 23 studies with 48 outcome events in the orthopaedic oncology literature, identified an overall median FI of 4, with a median FI for significant outcomes of only 223. Parisien et al., in 2 recent fragility analyses of RCTs in the cartilage restoration and rotator cuff literature, identified an overall FI of 4 and 4, respectively, as well as an FQ of 0.067 and 0.092, respectively[29,30]. The FI values for those orthopaedic studies align closely with that of our current evaluation of the hip arthroscopy literature. A number of studies evaluated additional statistical fragility correlates. Three fragility studies[22,24,25] found that an increasing FI correlated significantly with smaller (more significant) p values, and 3 studies[22,23,25] reported a positive correlation between FI and sample size. Interestingly, several fragility studies reported on outcomes resulting in an FI of 0, meaning the reversal of significance was determined by simply re-calculating the p value with an alternative statistical test. Khormaee et al. identified 3 articles (17.6%) with an FI of 025. Similarly, Evaniew et al. reported that 8 (20%) of the 40 outcomes that they assessed resulted in an FI of 0 following their own p value analysis[22]. Khan et al., in a study of the sports medicine literature, reported that an FI of 0 was identified for 8 outcomes (16.6%), leading the authors to report that “outcomes became nonsignificant when we recalculated the p value using the 2-sided Fisher exact test.”[24] Several studies further evaluated the effect of the number of patients with LTF on the resulting significance. Khormaee et al.[25], in an evaluation of 17 pediatric RCTs, reported that only 2 studies actually included LTF data, with 1 study revealing that the number of patients LTF was greater than the resultant FI[25]. This finding would suggest the potential reversal of study significance by simply maintaining follow-up. Similarly, Evaniew et al., in a comprehensive evaluation of the spine literature, found that the FI was less than or equal to the LTF value for 26 outcomes (65%)[22]. This pattern persisted in the sports medicine literature, with Khan et al.[24] identifying 23 outcomes (48%) with an LTF value that was greater than or equal to the FI. Additionally, in an evaluation of the sports medicine literature, Parisien et al.[26] reported that the average LTF value (7.9) was greater than the overall FI of 5. Furthermore, in an evaluation of the orthopaedic oncology literature, Forrester et al. found that 60% of the outcomes had an FI value that was less than or equal to the LTF value[23]. Parisien et al., in a recent fragility analysis evaluating the cartilage restoration literature, found that 15.8% of studies either did not report LTF data or reported an LTF value that was greater than the FI[29]. Additionally, in a systematic review and meta-analysis of RCTs evaluating the use of platelet-rich plasma in rotator cuff surgery, Parisien et al. revealed that, of the studies reporting LTF data, 30.2% reported an LTF value that was greater than the FI[30].The present study is the first to provide a detailed analysis of significance in the hip arthroscopy literature. Our findings further demonstrate the lack of correlation between journal impact factor and degree of fragility, thus emphasizing the importance of including measures such as the FI and FQ to provide additional context to reported p values. Additionally, the present study includes an analysis of both primary and secondary outcomes for a more comprehensive and accurate FI and FQ analysis.
Conclusions
The peer-reviewed hip arthroscopy literature may not be as stable as previously thought, as the utilization of a threshold p value has proven misleading. We therefore recommend reporting of the FI, FQ, and p value to aid in the evaluation and interpretation of statistical robustness and quantitative significance in future comparative hip arthroscopy studies.
TABLE I
Demonstration of Reversal of Significance with a Fragility Index of 1
Outcome A
Outcome B
P Value
Scenario 1
Treatment A
1
23
Treatment B
6
14
0.04
Scenario 2
Treatment A
2
22
Treatment B
6
14
0.11
TABLE II
Fragility Data Based on Trial and Outcome Characteristics
Authors: Marc J Philippon; Allston J Stubbs; Mara L Schenker; R Brian Maxwell; Reinhold Ganz; Michael Leunig Journal: Am J Sports Med Date: 2007-04-09 Impact factor: 6.202
Authors: Robert L Parisien; Cooper Ehlers; Antonio Cusano; Paul Tornetta; Xinning Li; Dean Wang Journal: Am J Sports Med Date: 2021-03-01 Impact factor: 6.202
Authors: Michael Walsh; Sadeesh K Srinathan; Daniel F McAuley; Marko Mrkobrada; Oren Levine; Christine Ribic; Amber O Molnar; Neil D Dattani; Andrew Burke; Gordon Guyatt; Lehana Thabane; Stephen D Walter; Janice Pogue; P J Devereaux Journal: J Clin Epidemiol Date: 2014-02-05 Impact factor: 6.437
Authors: Robert L Parisien; Michael Constant; Bryan M Saltzman; Charles A Popkin; Christopher S Ahmad; Xinning Li; David P Trofa Journal: Cartilage Date: 2021-05-10 Impact factor: 3.117
Authors: Nathan P Fackler; Theofilos Karasavvidis; Cooper B Ehlers; Kylie T Callan; Wilson C Lai; Robert L Parisien; Dean Wang Journal: Foot Ankle Int Date: 2022-08-24 Impact factor: 3.569