Georg Zimmermann1,2,3, Lisa-Maria Bolter3, Ronny Sluka4, Yvonne Höller1, Arne C Bathke3, Aljoscha Thomschewski1,2,4, Stefan Leis1, Simona Lattanzi5, Francesco Brigo6,7, Eugen Trinka1,8. 1. Department of Neurology, Christian Doppler Medical Centre, Paracelsus Medical University, Salzburg, Austria. 2. Spinal Cord Injury and Tissue Regeneration Centre Salzburg, Paracelsus Medical University, Salzburg, Austria. 3. Department of Mathematics, Paris Lodron University, Salzburg, Austria. 4. Department of Psychology, Paris Lodron University, Salzburg, Austria. 5. Neurological Clinic, Department of Experimental and Clinical Medicine, Marche Polytechnic University, Ancona, Italy. 6. Department of Neurosciences, Biomedicine and Movement Sciences, University of Verona, Verona, Italy. 7. Division of Neurology, Franz Tappeiner Hospital, Merano, Italy. 8. Department of Public Health, Health Services Research and Health Technology Assessment, UMIT, Hall i. T., Austria.
Abstract
AIM: Prevalence and incidence of spinal cord injury (SCI) are low. However, sample sizes have not been systematically examined yet, although this might represent useful information for study planning and power considerations. Therefore, our objective was to determine the median sample size in clinical trials on SCI individuals. Moreover, within small-sample size studies, statistical methods and awareness of potential problems regarding small samples were examined. METHODS: We systematically reviewed all studies on human SCI individuals published between 2014 and 2015, where the effect of an intervention on one or more health-related outcomes was assessed by means of a hypothesis test. If at least one group had a size <20, the study was classified as a small sample size study. PubMed was searched for eligible studies; subsequently, data on sample sizes and statistical methods were extracted and summarized descriptively. RESULTS: Out of 8897 studies 207 were included. Median total sample size was 18 (range 4-582). Small sample sizes were found in 167/207 (81%) studies, resulting limitations and implications for statistical analyses were mentioned in 109/167 (65%) studies. CONCLUSIONS: Although most recent SCI trials have been conducted with small samples, the consequences on statistical analysis methods and the validity of the results are rarely acknowledged.
AIM: Prevalence and incidence of spinal cord injury (SCI) are low. However, sample sizes have not been systematically examined yet, although this might represent useful information for study planning and power considerations. Therefore, our objective was to determine the median sample size in clinical trials on SCI individuals. Moreover, within small-sample size studies, statistical methods and awareness of potential problems regarding small samples were examined. METHODS: We systematically reviewed all studies on humanSCI individuals published between 2014 and 2015, where the effect of an intervention on one or more health-related outcomes was assessed by means of a hypothesis test. If at least one group had a size <20, the study was classified as a small sample size study. PubMed was searched for eligible studies; subsequently, data on sample sizes and statistical methods were extracted and summarized descriptively. RESULTS: Out of 8897 studies 207 were included. Median total sample size was 18 (range 4-582). Small sample sizes were found in 167/207 (81%) studies, resulting limitations and implications for statistical analyses were mentioned in 109/167 (65%) studies. CONCLUSIONS: Although most recent SCI trials have been conducted with small samples, the consequences on statistical analysis methods and the validity of the results are rarely acknowledged.
Spinal cord injury (SCI) individuals have to face various impairments, ranging from deficits in motor function to further complications, which may be even more serious and life‐threatening, such as urinary tract infections and bladder dysfunction,1, 2 or cardiovascular and respiratory diseases.3, 4 Recent studies indicate that the prevalence of SCI is about 25/100 000,5 although estimates vary from 11 to 112/100 000 due to methodological and regional differences.6, 7 Annual incidences range from about 8 to 80/1 000 000.7 In order to improve outcomes in SCI and the persons’ quality of life, numerous therapeutic approaches have been investigated. The scope ranges from pharmacological therapies8 to interventions targeting brain reorganization, neuroplasticity induced by stimulation techniques (eg, transcranial magnetic stimulation9) as well as rehabilitative and training programs for SCI individuals, either with or without the use of assistive devices.10, 11 These efforts have been complemented by systematic reviews of the scientific status quo,12 with the ultimate goal of setting up evidence‐based guidelines for treatment and rehabilitation of people with SCI.13, 14However, from a methodological point of view, the question whether the relatively small prevalence of SCI and well‐known difficulties regarding recruitment and compliance15 translate to small sample sizes warrants a systematic investigation. Ioannidis considered the probability that a research finding is indeed true,16 showing that this probability ispositively associated with the power of a study (ie, the probability that the treatment effect can be detected statistically, given that there is one) as well as the ratio of true to false hypotheses in a particular field, butnegatively associated with the type I error level (eg, falsely claiming that a new therapy is superior to the standard one) and the amount of bias.It should be noted that the power of a study is in turn influenced by several factors: Needless to say that one‐sided tests are more powerful than their two‐sided counterparts, that the power increases with growing effect size and larger significance level, and that the spread of the data is negatively associated with power. In particular, the latter fact explains why a paired test has higher power than its unpaired counterpart, given that all other specifications are the same. Furthermore, it is well known in mathematical statistics that some frequently used statistical methods may either over‐ or underreject the respective hypotheses in small sample situations.17, 18, 19 For example, power differences between the unpaired t‐test and the Wilcoxon‐Mann‐Whitney test depend on whether or not the assumptions underlying the t‐test are met. Moreover, power differences may result from the fact that the respective hypotheses being tested are not the same. Summing up, the choice of the statistical analysis method may lead to a further deflation of the probability that a research finding is true. Apart from that, especially small sample sizes might lead to a decrease in power20 and, thus, also to a decreased post‐study‐probability. Moreover, small sample sizes are also related to other methodological aspects: for example, there is evidence that an adequate sample size is required for avoiding covariate imbalance, which in turn could affect the validity of the results.21 Needless to say, however, that it would be inappropriate to judge the quality of SCI studies based on the examination of sample sizes and statistical methods only. Rather, the considerations from above demonstrate what could potentially happen, given that sample sizes are small.The aim of the present investigation is twofold: first, we would like to determine the median sample size in clinical trials on human individuals with SCI. Although we may speculate that the sample sizes are rather small, the principle of evidence‐based, systematic research warrants a thorough investigation.22 Moreover, as a secondary analysis, we consider the subset of studies with small sample sizes—the definition is stated in the Methods section—and examine some key characteristics of the statistical methods that were used as well as the question to which extent the authors mentioned potential problems related to small sample sizes.
METHODS
Study selection and data extraction
We chose a sample covering the two consecutive years that were most recent at the point we started the evaluation. In order to minimize the risk of bias due to delayed publication and post hoc changes in electronic databases, we considered all research papers, which were published between 01 January 2014 and 31 December 2015, and extracted the full list of abstracts on 23 January 2017 from PubMed. We selected articles that reported results from interventional studies on individuals with traumatic or nontraumatic SCI (eg, ischemia, infections). We used a rather broad definition of an interventional study, including any study where participants were prospectively assigned to at least one health‐related intervention, followed by an evaluation of its effect with respect to a health outcome.23 We deliberately did not restrict ourselves to randomized clinical trials, although this design is considered as the gold standard of interventional studies, because a substantial number of SCI studies might be nonrandomized due to various practical reasons.15 Moreover, we focused on studies, which included at least one group of humans with traumatic or nontraumatic SCI, thus excluding postmortem examinations, in‐vitro studies, case studies and papers dealing with healthy controls or animals only. Furthermore, we included only publications in English language.In addition to these criteria, which had been specified in advance, further restrictions were set up in the course of the screening process, adhering to the spirit of the pre‐defined criteria as closely as possible. First, we refined the distinction between clinical trials and case series, based on whether hypothesis tests or descriptive evaluations of the intervention effect were carried out. Then, we decided to exclude case series, because they are supposed to have less impact on decision‐making than clinical trials. Moreover, with respect to the secondary objective of investigating the statistical methods, there would have been no use in considering studies where only descriptive or narrative analyses had been done. Second, a research paper was excluded if the authors solely assessed the reliability of an instrument in the interventional part of the study (eg, reliability of blood pressure responses to a sit‐up test24), because the focus in these studies was not on improving any participant‐relevant outcome. Third, regarding duplicate publications, we applied a quite narrow definition and excluded studies only if they reported results from exactly one and the same sample. If small sample sizes were present and the statistical methods differed between studies, the information concerning the latter was merged. With respect to our main outcomes, this way of dealing with duplicate publications allowed for reducing bias while not losing substantial information.Screening and data extraction were conducted in the following way: based on a consensus decision of GZ, YH, AB, AT, and SL, the query “spinal cord injury OR tetraplegia OR paraplegia OR tetraparesis OR paraparesis OR spinal cord infarction OR spinal cord ischemia” was entered in the advanced search form in PUBMED, with the time frame for the date of publication set to “Jan 1, 2014 to Dec 31, 2015.” Since PubMed automatically applies term mapping, additional MeSH term searches were not done. We deliberately did not apply further filters to the searches, in order to minimize the risk of bias due to algorithm‐based exclusion. In order to enhance transparency, we provide the text file, which contains the abstracts of the search results, upon request. Due to the considerable amount of studies, we think that this makes more sense than providing lengthy tables containing all individual study details.As a first step, all abstracts were screened for eligibility by RS under the supervision of GZ, YH, and AT. This part included a pilot phase covering almost 100 abstracts, with the purpose of getting used to the screening process and refining inclusion and exclusion criteria. In order to avoid bias resulting from the fact that only one single person did the abstract screening, studies were only excluded if they clearly failed to meet at least one of the inclusion criteria mentioned above. In case there were doubts at this stage, the corresponding item was left in the list of included studies, postponing the decision to the next screening stage. As a second step, the full texts of all remaining studies—if available—were screened independently by GZ and LB. Studies were either excluded, following the aforementioned criteria, or classified as eligible for subsequent extraction of basic publication characteristics, sample sizes and statistical methods. If consensus could not be reached, the final decision was left to AB (statistician), YH (neuroscientist) or SL (clinician). Although some items were not applicable due to the specific methodological focus of our review, we tried to adhere to recommendations for conducting and reporting systematic reviews22, 25 as closely as possible (the PRISMA checklist is provided as a supplemental file).
Outcomes and statistical methods
The primary outcomes were the total sample size and the number of individuals with SCI. Both quantities refer to the sizes of the respective groups that were finally included in the statistical analyses rather than to the numbers of participants which had been initially considered as eligible.As to the secondary outcomes, within the subgroup of “small‐sample sizes studies,” additional information regarding the statistical test(s) used and the respective group sizes were recorded. A study had to meet the following two criteria, in order to qualify as a “small‐sample size study.” First, it was required that several groups (eg, different treatment arms in a clinical trial or repeated measurements taken from one single group before and after an intervention) were compared to each other with respect to a primary or secondary outcome by means of a statistical hypothesis test. Second, at least one of those groups had to comprise less than 20 individuals. Although this cutoff is somewhat arbitrary, the underlying rationale is that some frequently used hypothesis tests are based on certain assumptions, which cannot be meaningfully assessed in such scenarios. For example, based on 20 observations, it is impossible to assess whether the data was generated from a normal distribution or not (Supplementary Figure S1). Apart from that, the power for sample sizes below that cutoff might be considerably decreased even for comparisons of paired groups and moderate effects (please refer to the R code in the supplementary material). Within the small‐sample size studies, we also recorded if potentially related statistical problems were mentioned. In this case, a further distinction between “just mentioned,” “rule‐of‐thumb‐like implication for the choice of statistical methods,” and “statistically sound reasoning for the choice of statistical methods” was made. The detailed definitions of these categories are provided in Table S1 in the supplementary material. All variables collected in the present review had been specified by GZ, YH, and AB in advance. The primary and secondary outcomes defined above were analyzed using basic descriptive summary statistics and graphical displays of the collected data.
RESULTS
The search on PubMed identified 8897 publications. Applying the exclusion criteria lead to a substantial reduction down to 207 studies, which were eventually considered eligible for qualitative and quantitative syntheses. Details about the selection process can be seen in the flow chart (Figure 1).
Figure 1
Flow chart of the screening process, adapted from Ref. (25)
Flow chart of the screening process, adapted from Ref. (25)
Sample sizes
The median number of individuals with SCI was 15 (range: 4‐558, interquartile range: 10‐32), the median total sample size was 18 (range: 4‐582, interquartile range: 10‐34), see Figure 2. The subset of studies with a total sample size >100 (n = 12) comprised multicenter26, 27, 28 as well as single‐center trials.29, 30, 31, 32, 33, 34, 35 We found that 167 out of 207 (81%) eligible studies met our definition of a small‐sample size study. Within this subset, the median number of participants with SCI and total sample size were 13 (range: 4‐160, interquartile range: 9‐19) and 15 (range: 4‐160, interquartile range: 10‐22), respectively, see Figure 3.
Figure 2
Distribution of the number of participants with spinal cord injury (SCI) and the total sample size in clinical trials on individuals with SCI
Figure 3
Distribution of the number of participants with spinal cord injury (SCI) and the total sample size within the subgroup of small‐sample size clinical trials on people with SCI
Distribution of the number of participants with spinal cord injury (SCI) and the total sample size in clinical trials on individuals with SCIDistribution of the number of participants with spinal cord injury (SCI) and the total sample size within the subgroup of small‐sample size clinical trials on people with SCI
Statistical methods and mentioning potential problems
Three of 167 small‐sample size studies did not provide sufficient information about the statistical methods and were therefore excluded from subsequent analyses. In the remaining 164 studies, repeated measures analyses were conducted most frequently (parametric/nonparametric comparisons of paired groups: n = 143 studies vs unpaired tests: n = 52 studies). Parametric methods were more often used than nonparametric methods (123 vs 72 studies). The sizes of the groups that were compared showed large variation, due to a broad variety of design specifications (eg, different numbers of repeated measurements). Group sizes were equal in some studies, especially when paired comparisons were made, because balance is being imposed by design in this case (eg, 2 groups of size 14 each36). However, in other studies, highly unbalanced scenarios were found (eg, 2 groups of sizes 6 and 10, respectively37). Moreover, sometimes, comparisons of very small groups were reported (eg, pre‐post comparison for a group of three individuals38; two unpaired groups with three and five individuals, respectively39).The authors did not mention the small sample size issue at all in 58 out of 167 studies (35%). Awareness of potential problems related to small sample size, yet without referring to the choice of a particular statistical method, was present in 92 studies (55%). Different implications thereof were mentioned, ranging from considerable baseline variability, which could not be accounted for,40, 41 to limited generalizability of the results,42, 43 decreased power,44, 45 and the need for replication in larger samples.46, 47 Seventeen studies (10%) stated a rule‐of‐thumb like justification for the choice of a particular statistical approach, mainly that nonparametric methods were used instead of parametric approaches due to the small sample size.48, 49, 50, 51, 52, 53, 54, 55 However, no study was found where the authors provided a theoretically sound reason (eg, by referring to a simulation study, which indicates a good performance in small sample settings).
DISCUSSION
The results indicate that the total sample sizes in SCI trials are often quite small. Only few multicenter clinical trials could be identified, where the number of participants was large. Even in studies with moderate to large sample sizes, comparisons of subgroups were frequently conducted, leading to a proportion of small sample size studies of about 80 percent. Furthermore, we found that within small‐sample size studies, parametric methods were used more often than their nonparametric counterparts, and repeated‐measures analyses were conducted most frequently.Although these results might have some implications for the poststudy probability of the research findings being true, as outlined in the Introduction, they are by no means supposed to serve as an assessment of the quality of SCI research in general. It has been pointed out that systematic reviews are highly relevant for evidence‐based decision making. If conducted with methodological rigor and based on low‐bias studies, they provide a high level of evidence.22 However, reviews were excluded in the present study, which of course makes sense with respect to the investigation of sample sizes, but nevertheless prevents us from drawing definite conclusions regarding whether or not to trust the results from SCI research in general.Regarding the statistical methods used, we could come up with some remarkable findings.First, researchers used parametric methods more often than their nonparametric counterparts. However, parametric tests (eg, t‐tests) might yield erroneous results in terms of type I error probabilities (ie, the probability of getting a false positive), but also with respect to power, if the underlying assumptions are violated. We can only speculate about the extent to which this problem is present in SCI research, because a thorough examination of the question whether or not the assumptions underlying the respective tests were met would have exceeded the scope of the present review. However, it should be noted that in small datasets, it is virtually impossible to assess the underlying assumptions in a meaningful way. Apart from that, in cases where ordinal outcomes (eg, ASIA scores) are compared between groups by using a parametric test, the underlying assumptions are clearly violated, regardless whether the sample is small or large. Even though the classical nonparametric alternatives such as the Wilcoxon‐Mann‐Whitney test are applicable in more general settings, they might show suboptimal performance in small sample sizes, too.56 Likewise, we observed that chi‐squared tests were used quite frequently for the analysis of categorical data, although it is well‐known that they are based on an approximation and, hence, should only be applied to samples of reasonable size.57Second, comparisons of paired observations (ie, repeated measures) were more frequently conducted than between‐subject comparisons. In general, for fixed group sizes, a repeated measures design yields larger power than the comparison of unpaired groups, because the variance resulting from paired observations is smaller. Hence, even for small samples, the power might still be reasonably large. Nevertheless, consider, for example, a simple pre‐post‐design, where measurements are taken before and after a particular intervention of interest. Even if the effect size is moderate (ie, Cohen's d, which is the mean difference divided by the standard deviation, is equal to 0.5), the power is only about 56% for a sample size of 20 (details are provided in the R code in the supplementary material). With smaller sample sizes, the power decreases to 44% (n = 15) and 29% (n = 10). Nevertheless, at least the values, which have been extracted from the few publications where Cohen's d was provided indicate that for significant results, the corresponding median Cohen's d was equal to 1 (for details regarding effect sizes and variances, see Supplementary Material S1). Hence, as long as the group sizes do not drop below n = 10 (power for n = 9: 74.8%), the power might be acceptable for paired tests. However, if we took the first quartile instead of the median, that is, if we set d = 0.8, the power would drop to 56.0% for a group size n = 9. Hence, reporting of means and standard deviations of the group differences definitely needs to be improved, in order to allow for a thorough analysis of effect sizes, power, and sample size calculation methods.Third, in about two‐thirds of the small‐sample size studies, the authors were aware of the fact that the small sample size represented a limitation. Whereas in some cases, only one single, short sentence regarding this issue had been put down in the publication, other authors also discussed potential implications to some extent. Nevertheless, the impact of small sample sizes on power and post‐study probabilities was not mentioned. Likewise, references to studies examining the appropriateness of certain statistical methods in particular small‐sample size settings were not made.We found clear signs that a considerable number of researchers are aware of the problems regarding small sample sizes, at least to some extent. So, obviously, there is a chance of improving the situation—but how? In our opinion, this goal can be achieved only in a collaborative effort of all scientists and stakeholders involved in SCI research. Clinical researchers could try to establish new connections with other research groups, in order to set the stage for conducting adequately powered multicenter trials.16 Although such a trial might be demanding with respect to resources and does not come without methodological challenges (eg, standardization issues), there is no other way to obtain reliable large‐scale evidence. Clinical trial networks could serve as a convenient framework for such collaborative efforts. Apart from that, we propose that the editors of SCI journals consider the advice of expert statistics reviewers, and publish methodological papers on a more or less regular basis. A short tutorial‐like paper, aimed at explaining and illustrating a certain methodological or statistical topic in a nontechnical style, could substantially facilitate understanding and increase awareness of potential problems related to small‐sample settings. However, not only clinical researchers and journal editors could contribute to improving the status quo. Also, statisticians might have to change their attitudes towards publishing more in topic‐oriented medical journals. Although a paper focused on theory is, of course, a great achievement, the merit of providing a nontechnical explanation of a certain methodological topic is equally outstanding, because this could lead to an increased reliability of research findings concerning treatment options for severely impaired people. Moreover, sessions that are focused on statistical methods should become an integral part of clinical research conferences, in order to further stimulate the interdisciplinary exchange of new methodological approaches. Apart from discussing advances regarding the statistical techniques, those interactions should also be focused on a shift towards more emphasis on clinical rather than statistical significance. Especially in people with SCI, the treatment groups, which are compared with respect to a certain outcome, might be very heterogeneous with respect to several patient characteristics (eg, incomplete vs complete, considerable variation in the years since injury). Therefore, identifying responders to interventions (ie, individuals that make clinically relevant improvements) on a case‐by‐case basis might sometimes be more sensible than comparing average responses between very small subgroups by means of a statistical hypothesis test.The quality of the included studies was not taken into account in the analyses, because quality appraisal is more relevant for systematic reviews focused on health‐related outcomes. Although there might be some associations between the choice of statistical methods and other aspects of a study's quality, a closer look at this issue lies beyond the scope of the present publication. Likewise, a risk of bias assessment, which is usually recommended for systematic reviews,25 was not carried out, because within‐study bias (eg, bias due to lack of randomization and unblinded outcome assessment) is irrelevant with respect to sample sizes and statistical methods. By contrast, bias across studies could indeed play a role. Although the relationship might not be perfect, it is generally assumed that small studies are more prone to publication bias.58 Consequently, our results might still overestimate the “typical” sample size of a SCI trial. Likewise, publication bias and selective outcome reporting are also likely to have a considerable impact on the assessment of the statistical methods that have been used. In this case, however, we can only speculate about the direction of bias.We have tried to minimize the risk of selection bias by carefully considering several methodological issues in advance. First, when doing the database search, we kept the application of filters to a minimum, in order to avoid bias due to algorithm‐based pre‐selection of studies. Second, we employed a quite conservative methodology for the process of study selection, since we excluded studies at the abstract screening stage only if we could be really sure that they did not meet our inclusion criteria. Furthermore, each stage was supervised by clinically and statistically experienced researchers, who shared their opinion with the reviewers when unclear cases were encountered. Moreover, even the unambiguous results of the abstract screening underwent a rough overall check for errors by two reviewers. However, potential bias introduced at that stage cannot be entirely excluded, because it was decided due to the limited available resources that the abstract screening should be done by one single person (RS), only. Third, clinical trial registries were not sought. However, with respect to our research question, the information provided in registries would not have helped much anyway, since obviously, the final sample sizes cannot be obtained from a study protocol. Moreover, with regard to the statistical methods, there could be a considerable difference between what had been prespecified and what was actually carried out.The present systematic review has some limitations. For feasibility reasons, we have decided to perform the literature search in one single database only. However, presumably, additional searches in databases like EMBASE would mainly increase the number of studies that were published as conference abstracts, etc and, therefore, lead to even smaller median sample sizes. Furthermore, we have not conducted any stratified analyses of the sample sizes (eg, stratified according to different types of interventions), because we wanted to provide a broad overall picture. Finally, the present systematic review is not supposed to serve as an overall assessment of the validity of findings from SCI studies, nor should it be regarded as an evaluation of the statistical quality in general. Rather, our work is focused on two particular goals, namely on determining the median sample size in SCI interventional studies and examining some key characteristics of the statistical methods.To conclude, for the first time, we provide systematically collected evidence, which indicates that sample sizes in SCI interventional studies are small. Moreover, some awareness of potential problems related to small sample sizes seems to be present, yet without giving a more detailed account of implications for the choice of statistical methods in most cases. Further systematic assessments of selected methodological topics (eg, sample size calculation and power) are warranted.Table S1 List of the pre‐specified variables and the corresponding detailed descriptionsTable S2 Summary statistics of means and variances for paired settings (N = 70). The values were summarized by median (first quartile‐third quartile)Table S3 Summary statistics of means and variances for unpaired settings (N = 18). The values were summarized by median (first quartile‐third quartile)Figure S1 Both histograms are based on samples of size n = 20 from a normal distributionSupplementary Material S1: Description of the methodology that was used for the extraction of effect sizes, and corresponding resultsClick here for additional data file.Supporting InformationClick here for additional data file.Supporting InformationClick here for additional data file.
Authors: Mónica Alcobendas-Maestro; Ana Esclarín-Ruz; Rosa M Casado-López; Alejandro Muñoz-González; Guillermo Pérez-Mateos; Esteban González-Valdizán; José Luis R Martín Journal: Neurorehabil Neural Repair Date: 2012-06-13 Impact factor: 3.919
Authors: Samford Wong; Ali Jamous; Jean O'Driscoll; Ravi Sekhar; Mike Weldon; Chi Y Yau; Shashivadan P Hirani; George Grimble; Alastair Forbes Journal: Br J Nutr Date: 2013-09-18 Impact factor: 3.718
Authors: Georg Zimmermann; Lisa-Maria Bolter; Ronny Sluka; Yvonne Höller; Arne C Bathke; Aljoscha Thomschewski; Stefan Leis; Simona Lattanzi; Francesco Brigo; Eugen Trinka Journal: J Evid Based Med Date: 2019-06-23