Literature DB >> 32214370

Applying univariate vs. multivariate statistics to investigate therapeutic efficacy in (pre)clinical trials: A Monte Carlo simulation study on the example of a controlled preclinical neurotrauma trial.

Hristo Todorov^1,2, Emily Searle-White³, Susanne Gerber¹.

Abstract

BACKGROUND: Small sample sizes combined with multiple correlated endpoints pose a major challenge in the statistical analysis of preclinical neurotrauma studies. The standard approach of applying univariate tests on individual response variables has the advantage of simplicity of interpretation, but it fails to account for the covariance/correlation in the data. In contrast, multivariate statistical techniques might more adequately capture the multi-dimensional pathophysiological pattern of neurotrauma and therefore provide increased sensitivity to detect treatment effects.
RESULTS: We systematically evaluated the performance of univariate ANOVA, Welch's ANOVA and linear mixed effects models versus the multivariate techniques, ANOVA on principal component scores and MANOVA tests by manipulating factors such as sample and effect size, normality and homogeneity of variance in computer simulations. Linear mixed effects models demonstrated the highest power when variance between groups was equal or variance ratio was 1:2. In contrast, Welch's ANOVA outperformed the remaining methods with extreme variance heterogeneity. However, power only reached acceptable levels of 80% in the case of large simulated effect sizes and at least 20 measurements per group or moderate effects with at least 40 replicates per group. In addition, we evaluated the capacity of the ordination techniques, principal component analysis (PCA), redundancy analysis (RDA), linear discriminant analysis (LDA), and partial least squares discriminant analysis (PLS-DA) to capture patterns of treatment effects without formal hypothesis testing. While LDA suffered from a high false positive rate due to multicollinearity, PCA, RDA, and PLS-DA were robust and PLS-DA outperformed PCA and RDA in capturing a true treatment effect pattern.
CONCLUSIONS: Multivariate tests do not provide an appreciable increase in power compared to univariate techniques to detect group differences in preclinical studies. However, PLS-DA seems to be a useful ordination technique to explore treatment effect patterns without formal hypothesis testing.

Entities: Chemical Disease Species

Mesh：

Year: 2020 PMID： 32214370 PMCID： PMC7098614 DOI： 10.1371/journal.pone.0230798

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

The aim of controlled preclinical studies is usually to investigate the therapeutic potential of a chemical or biological agent, or a certain type of intervention. For this purpose, animals are randomized to a control group and a number of treatment groups in a manner similar to clinical trials. For quantitative endpoints, treatment effects are evaluated by assessing mean differences between control and intervention groups. In an effort to obtain as much information as possible with minimal cost of life, usually multiple endpoints are included in the trial [1], which is further motivated by the fact that the optimal efficacy endpoint for a specific disease might not be known. In this context, the null hypothesis of no treatment effect (H0) can be rejected in two ways. The standard approach consists of performing independent univariate tests on each variable separately. However, this strategy might lead to an inflated family-wise error rate. In addition, different endpoints are usually correlated, implying that preclinical trials are multi-dimensional in nature. Consequently, the second approach is to use a multivariate technique, which accounts for the covariance/correlation structure of the data. H0 is usually tested on some kind of linear combination of the original variables. Due to the increased complexity of analysis and interpretation of results in this case, such an approach has found limited use in preclinical research so far. A number of studies have highlighted the potential benefits of multivariate techniques in the context of preclinical trials [2] and more specifically animal neurotrauma models [3-7]. Traumatic or ischemic events to the central nervous system such as stroke, spinal cord or traumatic brain injury are followed by a multi-faceted pathophysiology which manifests on molecular, histological and functional levels [8-11]. Individual biological mechanisms that are disrupted by or result from the neurotrauma such as apoptosis [12, 13], neuroinflammation [14-18], oxidative stress [18-20] and plasticity alterations [21, 22] have provided therapeutic targets in animal models. However, translation of candidate therapies to humans continues to be mostly unsuccessful [23-26]. Many studies indicate that individual biological processes interact together in determining functional outcome, which is why multivariate measures might capture the complex disease pattern more successfully and therefore detect therapeutic intervention efficacy with increased sensitivity [3, 4]. However, no solid proof of the superiority of multivariate methods beyond these theoretical considerations has been ascertained so far. The aim of our current study was to obtain empirical evidence as to whether univariate or multivariate statistical techniques are better suited for detecting treatment effects in preclinical neurotrauma studies. For this purpose, we performed simulations under a broad range of conditions while simultaneously trying to mimic realistic experimental conditions as closely as possible. We investigated the empirical type I error rate as well as empirical power of several competing techniques and evaluated factors which impact their performance.

Methods

Simulation procedure

We performed a Monte Carlo study using the statistical software R [27] and following recommendations of Burton et al. for the design of simulation studies [28]. Artificial data were based on a real study in a rat model of traumatic brain injury. In the preclinical trial, twenty animals per group received either vehicle control or a therapeutic agent. Functional outcome was evaluated based on 6 different endpoints including 20-point neuro-score, limb placing score, lesion and edema volume, and T2 lesion in the ipsilateral and contralateral hemisphere. All variables were measured repeatedly on three time points, therefore resulting in a data matrix with 18 columns. In order to obtain more general estimates of the mean vector and covariance matrix for subsequent simulations, a non-parametric bootstrap procedure was applied using the data from the saline control group from the in vivo study. Since two animals from this group were excluded from the study, the resampling procedure was conducted with the available 18 animals. 10,000 samples were drawn from the original data with replacement and the average mean vector and covariance matrix were then calculated. In order to retain the covariance structure of the data, complete rows of the data matrix (corresponding to all measurements from a single animal) were always sampled as a 18x1-dimensional vector. The nearPD R function was then employed to force the calculated dispersion matrix to be positive definite. The resulting mean vector and covariance matrix were used as parameters for multivariate distributions, from which data for subsequent simulations were sampled (see S1 Appendix of Tables 1 and 2). We generated one control group and three treatment groups under each scenario, which corresponds to a typical preclinical trial design where increasing doses of a therapeutic agent are tested against a control treatment.

Simulation factors

Sample size

We performed simulations with 5, 10, 15 and 20 measurements per treatment group to investigate the impact of sample size. These values were selected to represent realistic group sizes commonly encountered in preclinical trials. Additionally, we performed simulations with 30, 40 and 50 replicates per group to investigate the effect of a larger sample size beyond those typical for animal studies. In the course of this study we use the terms measurements, subjects and replicates per group interchangeably.

Effect size

Treatment effects were based on Cohen’s d with values set to 0, 0.2, 0.5 and 0.8 corresponding to no effects, small, moderate and large statistical effect sizes relative to the control group, respectively [29]. We chose Cohen’s d because this standardized statistical measure of effect size is independent of the scale of the original variables. The population mean values for the treatment groups were then calculated using the formula μ1 = μ0±s*d, where μ0 corresponds to the population mean of the respective variable in the control group and s signifies the standard deviation of both groups in case of equal variance or the average standard deviation in case of unequal variance. We performed simulations with no treatment effects in all groups to investigate empirical type I error rate. Additionally, we investigated empirical power by simulating either large, moderate or small effects in the treatment groups relative to the control group.

Distribution of dependent variables

The dependent variables were simulated to follow a multivariate normal distribution to comply with the assumptions of the investigated methods. Additionally, we employed the multivariate lognormal distribution and the multivariate gamma distribution in order to investigate the impact of departures from normality. The multivariate gamma distribution was modelled using its shape parameter α and its rate parameter β. These parameters were derived from the target mean and variance values using the following relationships: μ = α/β and σ2 = α/β2, where μ and σ2 correspond to the mean and variance of the gamma distribution, respectively. Since we wanted to simulate specific values for the mean and variance, we used the following equations to obtain the shape and rate parameter of the gamma distribution: α = μ2/ σ2 and β = μ/ σ2. The correlation matrix used for the simulation of multivariate data sets is shown in S1 Appendix of Table 2.

Variance

Parametric univariate methods to detect mean differences assume that variance in all groups is equal, which in the multivariate case extends to the assumption of homogeneity of covariance matrices [30]. Therefore we first performed simulations with all groups having equal variance. Then we simulated treatment groups having variance twice or 5 times higher than the variance in the control group. This allowed us to investigate the impact of increasing variance heterogeneity. Factors were crossed to produce 252 different simulation scenarios with 1000 replicate data sets generated under each combination of simulation conditions.

Methods to detect treatment effects

Univariate statistics

The univariate approach of investigating treatment differences between groups consisted of a series of independent analysis of variance (ANOVA) tests on each outcome variable separately. Furthermore, we applied Welch’s ANOVA as implemented in the oneway.test R function, which does not assume equal variance between groups [31]. In order to take the repeated measures nature of the input data into account, we also performed linear mixed effects tests for each endpoint. Since we did not simulate an interaction between treatment effect and time, we only included the main effects in the mixed effects model without an interaction term. We rejected H0 of no treatment effect if the main effect for the treatment factor was significant.

Multivariate statistics

The first multivariate strategy we investigated was performing ANOVA tests on principal component (PC) scores obtained from the original variables. We used eigen decomposition of the population correlation matrix in order to calculate the PCs, which is the preferred approach when variables are measured on different scales [30, 32]. Based on the Kaiser criterion, we only retained components whose corresponding eigenvalue was greater than one [33]. Component scores were obtained by multiplying the standardized data matrix of original variables with the eigenvectors of the population correlation matrix [32]. The second multivariate technique consisted of a series of multivariate analysis of variance (MANOVA) tests on each study variable with repeated measures. Each repeated measure was considered a separate dependent variable for the respective MANOVA. Thus, we performed 6 MANOVA tests, each of which included the three repeated measures of one endpoint as the dependent variables. The significance of the MANOVA tests was evaluated using four different statistics which are commonly provided by statistical software such as R, SAS or SPSS: Wilks’ lambda [34], Lawley-Hotelling trace [35], Pillai’s trace [36] and Roy’s largest root [37]. In all cases, H0 was rejected when the p-value from the omnibus test was less than 0.05; no specific contrasts or post hoc analyses were considered. Different techniques were evaluated based on the empirical type I error rate or on empirical power. Empirical type I error rate was defined as the number of significant statistical tests divided by the total number of tests when no treatment effects were simulated. Empirical power was defined as the number of significant tests divided by the total number of tests in the cases when treatment effects were simulated.

Multivariate dimensionality reduction techniques for pattern analysis

In addition to formally comparing the type I error rate and power of univariate and multivariate statistics, we also investigated if ordination techniques might be useful to detect patterns of treatment effects in multi-dimensional preclinical data sets. We focused on methods that perform ordination and dimensionality reduction based on Euclidean distances and are therefore suitable for quantitative and semi-quantitative data. First, we applied PCA, linear discriminant analysis (LDA), redundancy analysis (RDA), and partial least squares discriminant analysis (PLS-DA) on 1000 simulated data sets with 5 measurements per group and no treatment effects. We plotted the first versus the second multivariate dimension and visually inspected the plots. If the 95% confidence ellipse around the control group did not overlap with the confidence ellipses around the data points for the treatment groups, we considered that the ordination method falsely captured a treatment effect pattern in the data. Next, we examined the sensitivity of the ordination methods to detect true treatment effect patterns by simulating 1000 data sets with 5 measurements per group and huge treatment effects (Cohen’s d = 2.0). We used this effect size as we did not observe a difference between groups when smaller effect sizes were simulated. We considered that the respective method correctly accounted for a treatment effect pattern in the data if the 95% confidence ellipse around the control group did not overlap with the confidence ellipses around the simulated treatment groups. Finally, we provide an applied example of combining dimensionality reduction techniques with formal hypothesis testing using one simulated data set with 5 measurements per group and treatment effects on only half of all the variables.

Results

Competing multivariate statistics

Prior to investigating the performance of univariate and multivariate techniques, we examined the four MANOVA test statistics in order to identify the most appropriate for subsequent comparisons. Fig 1 shows representative results for the type I error and power of the MANOVA test (see S1 Appendix of Figs 1–4 for complete results) using the four different statistical criteria. We observed the same trend under all simulation scenarios with Roy’s largest root having a considerably high false positives rate over 30%. In contrast, the remaining statistics exhibited very similar type I error rates. Pillai’s trace was the most robust measure followed by Wilks’ lambda and Lawley-Hotelling trace. Roy’s largest root was not considered with regards to power analysis due to the unacceptably high type I error rate. Pillai’s trace consistently demonstrated the lowest power. In contrast, Wilks’ lambda was associated with a slightly higher probability of correctly rejecting the null hypothesis in the presence of treatment effects than Pillai’s trace but it was outperformed by Lawley-Hotelling trace. However, we chose Wilks’ lambda for further analysis because it provided a good compromise between type I error rate and power in comparison to the other multivariate test statistics.

Fig 1

Performance of different multivariate statistics.

Example plots show empirical type I error and power of the MANOVA test using four common multivariate test statistics. Type I error rate is shown for the simulation scenario with no treatment effects, equal variance in all groups and data drawn from a multivariate normal distribution. An example of power analysis is shown for a simulation with large treatment effects (Cohen’s d equal to 0.8), equal variance in all groups and data sampled from a multivariate normal distribution. Hotelling: Lawley-Hotelling trace; Pillai: Pillai’s trace; Roy: Roy’s largest characteristic root; Wilks: Wilks’ lambda.

Performance of different multivariate statistics.

Type I error rate of univariate and multivariate techniques under different simulation conditions.

The title of each plot reports the multivariate distribution from which the data were sampled as well as the variance ratio between the simulated control and treatment groups. ANOVA: Analysis of variance; MANOVA: Multivariate analysis of variance; Mixed: Linear mixed effects model; MV: Multivariate; PCA: Principal component analysis.

Empirical power of univariate and multivariate techniques in case of large treatment effects (Cohen’s d equal to 0.8).

The multivariate distribution from which the data were drawn as well as the variance ratio between simulated control and treatment groups are summarized in the title of each respective plot. ANOVA: Analysis of variance; MANOVA: Multivariate analysis of variance; Mixed: Linear mixed effects model; MV: Multivariate; PCA: Principal component analysis.

Empirical power of univariate and multivariate techniques in case of moderate treatment effects (Cohen’s d equal to 0.5).

False positive rate

Empirical type I error rates of the methods we evaluated under different simulation scenarios are summarized in Fig 2. Differences between univariate and multivariate methods were negligible under all simulation conditions. Furthermore, all methods managed to remain close to the nominal level of type I error rate around 5% even in the case of extreme variance heterogeneity (variance ratio between control and treatment group equal to 1:5). Interestingly, Welch’s ANOVA was associated with a slightly higher false positive rate compared to other methods when data were sampled from a multivariate lognormal distribution combined with extreme variance heterogeneity. Furthermore, linear mixed effects models had a slightly higher type I error rate in the case of 5 subjects per group.

Fig 2

Type I error rate of univariate and multivariate techniques under different simulation conditions.

Empirical power

The results we obtained for empirical power under different simulation conditions are depicted in Figs 3–5. Linear mixed effects models outperformed the remaining methods in the case of variance equality or moderate variance heterogeneity (variance ratio 1:2) with smaller sample sizes of 5 to 20 subjects per group regardless of the effect size we simulated. Welch’s ANOVA was as powerful as regular ANOVA when the variance between the control and treatment groups was equal. Furthermore, Welch’s ANOVA outperformed all other methods when we simulated moderate or small effect sizes combined with extreme variance heterogeneity (ratio of 1:5 between the control and treatment groups) and data coming from multivariate lognormal or gamma distributions. MANOVA tests were slightly more powerful than the two types of ANOVA in the cases of equal variance but still failed to outperform linear mixed effects models under these simulation scenarios. The multivariate strategy of ANOVA tests on PCA scores was universally associated with the lowest rate of rejecting H0. It is also worth mentioning that adequate levels of power of around 80% were achieved in the case of at least 20 measurements per group and large treatment effects (Cohen’s d equal to 0.8, Fig 3). Simulating moderate treatment effects (Cohen’s d equal to 0.5, Fig 4) required a sample size of at least 40 replicates per group in order to achieve levels of power of around 80% Finally, the rate of rejecting H0 varied between 5% and 25% when we simulated small treatment effects (Cohen’s d equal to 0.2, Fig 5).

Fig 3

Empirical power of univariate and multivariate techniques in case of large treatment effects (Cohen’s d equal to 0.8).

Fig 5

Empirical power of univariate and multivariate techniques in case of small treatment effects (Cohen’s d equal to 0.2).

Fig 4

Empirical power of univariate and multivariate techniques in case of moderate treatment effects (Cohen’s d equal to 0.5).

Empirical power of univariate and multivariate techniques in case of small treatment effects (Cohen’s d equal to 0.2).

Comparison of ordination techniques for pattern analysis of treatment effects

We investigated if the dimensionality reduction techniques LDA, PCA, RDA, and PLS-DA could be useful for investigating patterns of treatment effects without formal hypothesis testing. In 1000 simulated data sets without treatment effects and 5 measurements per group, we counted how often the control group was separated from treatment groups along the first and second multivariate dimensions (indicated by non-overlapping 95% confidence ellipses). LDA captured a false treatment effect pattern in 387 cases corresponding to a false positive rate of 38.7%. In contrast, the control group was not separated from treatment groups in any of the simulated sets when using PCA, PLS-DA, or RDA for dimensionality reduction. Example plots are shown in Fig 6 (the whole set of plots is available in S2 Appendix). Due to the unacceptably high false positive rate, we did not further consider LDA. Next, we simulated 1000 data sets with huge treatment effects (Cohen’s d equal to 2.0) with 5 measurements per group and investigated how often the control group was separated from treatment groups in reduced multivariate space. PLS-DA managed to capture the true treatment pattern in 13.8% of the cases whereas PCA only separated the control from treatment groups in 7.7% of the simulations. RDA only marginally outperformed PCA and reported a true treatment effect pattern in 9.6% of the cases (the complete simulated set of plots is available in S3 Appendix).

Fig 6

Comparison of ordination techniques for pattern analysis in the case when no treatment effects were simulated.

Plots show results for one out of 1000 simulations with 5 measurements per group drawn from a multivariate normal distribution with equal variance between control and treatment groups. The ordination technique was considered to falsely capture a treatment effect pattern in the data in case of non-overlapping 95% confidence ellipse of the control group with the confidence ellipses for the treatment groups (dose1 to dose3). LDA: Linear discriminant analysis; PCA: Principal component analysis; PLS-DA: Partial least squares discriminant analysis; RDA; Redundancy analysis.

Comparison of ordination techniques for pattern analysis in the case when no treatment effects were simulated.

A practical example of applying ordination techniques and statistical testing methods

In order to give an example of how ordination techniques can be combined with statistical testing methods in practice, we simulated a data set with 5 variables per group and huge treatment effects for 9 out of the total 18 variables which we randomly selected. The endpoints with simulated treatment effects were 20-point neuroscore on day 1 and day 7, limb placing score on day 1 and day 7, lesion volume on day 1 and day 7, edema volume on day 1 and day 14 and T2 lesion in the contralateral cortex on day 1. The remaining 9 variables were drawn from the same distributions in the control and the 3 treatment groups without simulated treatment effects. In the first step of the analysis, we applied PLS-DA which was the most sensitive technique in our simulations to investigate if the control group differed from the treatment groups in reduced multivariate space. We observed that the control group was separated from the treatment groups along the first multivariate axis which accounted for 36% of the variance (Fig 7). In order to investigate which of the original variables are responsible for group separation, we calculated the correlations of the original variables with the first PLS-DA multivariate dimension (axis 1) along which the control and treatment groups were separated. Correlations with an absolute value below 0.5 were set to 0 in order to filter out unimportant variables. The correlation pattern indicated that all variables with simulated treatment effects along with two additional variables (lesion volume at day 14 and T2 lesion at day 14) contributed to the separation of the control from the treatment groups. Therefore, PLS-DA managed to capture the treatment effect pattern by identifying all original variables with simulated treatment effects as important for group separation in reduced space.

Fig 7

Partial least squares discriminant analysis (PLS-DA) to identify treatment effect patterns.

Partial least squares discriminant analysis (PLS-DA) to identify treatment effect patterns.

We simulated a data set with 5 measurements per group and huge treatment effects for 9 randomly selected endpoints out of the 18 variables in the data set. The control group was separated from the treatment groups along the first multivariate dimension in the PLS-DA analysis We calculated the correlation of the original variables with this dimension to identify which original endpoints explained the multivariate pattern. Correlations with an absolute value below 0.5 were set to 0 in order to filter out unimportant variables. All 9 variables with simulated treatment effects were significantly correlated with the first multivariate axis. Two additional variables without simulated treatment effects (lesion volume at day 14 and T2 lesion at day 14) were also significantly correlated with the first multivariate axis. Next, we followed up on the multivariate pattern analysis by performing statistical testing with linear mixed effects models for each variable with repeated measures. The interaction term between treatment and time was highly significant for all six endpoints thereby rejecting H0 of no treatment effects even for T2 lesion, which was the only variable without any simulated treatment effects at any time point. Next, we performed post-hoc analysis comparing the treatment groups against the control group for each time point separately. Results are shown in Table 1. The difference for the 20-point neuroscore was significant only between treatment groups 2 and 3 compared to the control group and no statistically significant difference was detected for 20-point neuroscore at day 7. Similarly, post-hoc analysis did not detect a treatment effect for any of the groups for lesion volume at day 7 and edema volume at day 14. In contrast, all treatment effects were identified for lesion volume at day 1, edema volume at day 1 and T2 lesion in the contralateral cortex at day 1.

Table 1

Post-hoc analysis following linear mixed effects models for variables with repeated measures.

Variable	Control vs. dose 1	Control vs. dose 2	Control vs. dose 3
20 point neuroscore day 1	0.51	0.0001	0.0027
20 point neuroscore day 7	0.293	0.095	0.074
20 point neuroscore day 14	0.688	0.593	0.354
Limb placing score day 1	1.000	0.483	0.047
Limb placing score day 7	0.298	0.033	0.047
Limb placing score day 14	0.297	0.483	0.383
Lesion volume day 1	<0.0001	<0.0001	<0.0001
Lesion volume day 7	0.119	0.117	0.131
Lesion volume day 14	0.778	0.332	0.487
Edema volume day 1	0.0002	<0.0001	<0.0001
Edema volume day 7	0.266	0.494	0.824
Edema volume day 14	0.129	0.122	0.087
T2 lesion day 1	0.309	0.338	0.453
T2 lesion day 7	0.826	0.627	0.827
T2 lesion day 14	0.203	0.001	0.05
T2 lesion contralateral cortex day 1	0.004	0.0005	<0.0001
T2 lesion contralateral cortex day 7	0.230	0.286	0.316
T2 lesion contralateral cortex day 14	0.828	0.201	0.529

We performed linear mixed effects analysis for each endpoint with repeated measures followed by post-hoc pairwise comparisons between the control and each treatment group for each time point separately. Variables with simulated treatment effects are highlighted with a bold font. The p-values from the post-hoc comparisons are reported in the table. P-values less than 0.05 are highlighted with a bold font. The difference between the control and treatment groups 2 and 3 for T2 lesion at day 14 was reported as significant even though we did not simulate treatment effects for this variable. Altogether, post-hoc analysis following linear mixed effects models captured most but not all individual differences between the control and treatment groups. In contrast, the multivariate pattern analysis using PLS-DA marked all variables with simulated treatment effects as important for group separation in reduced multivariate space.

Discussion

Using Monte Carlo simulations, we evaluated the performance of a number of univariate and multivariate techniques in an effort to identify the most optimal strategy for detecting treatment effects in preclinical neurotrauma studies. Importantly, type I error rate was not drastically inflated beyond the 5% nominal rate for all hypothesis testing methods under the simulation scenarios we investigated, even when assumptions of normality and homogeneity of variance were violated. Nevertheless, we only simulated a maximal variance inequality ratio of 1:5 between control and treatment group. Moreover, sample size was always equal. Extreme heterogeneity is more problematic in case of unequal group sizes especially when the smallest group exhibits the largest variance [38]. In such cases, a variance-stabilizing transformation such as log-transformation of the response variables is advisable. Alternatively, in the univariate case, a non-parametric technique might be used (e.g. Friedman or Kruskal-Wallis test). In case that MANOVA is performed, a more robust statistic might be chosen. Our results suggest that Pillai’s trace would be the most appropriate under these conditions. In terms of power, taking the repeated measures nature of the data into account proved to be the optimal strategy as linear mixed effects models outperformed the other methods when variance between groups was equal or when variance heterogeneity was moderate. Linear mixed effects are a flexible class of statistical methods which allow building models of increasing complexity with different combinations of random intercepts and slopes. In practice, however, it might be challenging to assess the significance of fixed effects in the model based on F-tests as the degrees of freedom might not be correctly estimated. In our current study, we used the Kenward-Roger approximation for determining the degrees of freedom [39]. Alternatively, likelihood ratio tests might be used in order to test if including the factor of interest significantly improves the model fit compared to a model without the specific factor. Importantly, this requires refitting the linear mixed effects model using maximum likelihood to estimate parameters as usually these models are calculated using restricted maximum likelihood. When the assumptions of normality and homogeneity of variance were violated, univariate Welch’s ANOVA tests outperformed the remaining methods especially with small effect sizes. Furthermore, the rate of rejecting H0 was equivalent to that of standard ANOVA when data were sampled from a multivariate normal distribution with equal variance between groups. These results suggest that Welch’s ANOVA might be more appropriate for statistical testing of treatment effects than the much more popular standard ANOVA F-test. Additionally, univariate methods offer the advantage of directly investigating differences on endpoints of interest whereas multivariate tests are applied on a linear combination of the original variables. Nevertheless, ignoring the correlation structure of the response variables may result in misleading conclusions. Correlated variables reflect overlapping variance and therefore univariate tests provide little information about the unique contribution of each dependent variable [30]. The issue of correlated outcome measures is addressed by employing multivariate methods. When differences are evaluated between groups which are known a priori, MANOVA is the technique of choice. In our study, MANOVA offered a marginally higher power than univariate ANOVAs when the assumption of variance homogeneity was met. However, a practical issue of this method is that standard software reports four different statistics which do not always provide compatible results. Under all simulation conditions we investigated, Roy’s largest root was associated with an unacceptably high type I error rate. This would make interpretation of results with real high-dimensional data sets with few measurements per variable very ambiguous. However, Wilks’ lambda, Lawley-Hotelling trace and Pillai’s trace were robust to false positives. In agreement with previous reports, Pillai’s criterion was the most conservative, which would make it more appropriate when assumptions of MANOVA are violated [40, 41]. Nevertheless, we opted to use Wilks’ lambda for subsequent comparisons between different techniques because it offered similar robustness but slightly increased power. Another trade-off of MANOVA and multivariate techniques in general is the complexity of interpretation. If the omnibus test is significant, a researcher will often want to more precisely identify the variables which are responsible for group separation. Ideally, follow-up tests should retain the multivariate nature of the analysis. Such strategies include descriptive discriminant analysis [30, 42] or Roy-Bargmann stepdown analysis [30, 43]. A crucial factor we did not consider in our study is missing data which cannot be handled by multivariate statistical methods. If the degree of missingness is within a reasonable range (e.g. not more than 10%) and the assumption of missing at random is satisfied, then a multiple imputation technique might be employed to estimate the missing data from the existing measurements. Otherwise, a more flexible data analysis method must be employed such as for instance linear mixed effects models, which are able to handle missing data. Since MANOVA only very marginally outperformed univariate ANOVAs and failed to provide an increase of power compared to linear mixed effects models, we believe that this does not offset the increased complexity and inability to handle missing data. Therefore, our results would suggest that MANOVA tests are not a practical option for formal hypothesis testing in preclinical studies with small sample sizes. It is important to note that different methods achieved acceptable levels of power of around 80% only when we simulated large treatment effects with 20 measurements per group or moderate effects with at least 40 replicates per group. This finding highlights a serious issue not only in neurotrauma models but in preclinical research altogether, namely that typical sample sizes in animal studies do not ensure adequate power unless the effect size is large. Accordingly, some authors argue that animal studies should more closely adhere to the standards for study conduct and reporting applicable to controlled clinical trials [1, 44]. In a randomized clinical study, sample size is calculated a priori based on a specific effect size, assumptions about the variance in the response variable, and the desired level of power. In theory, the ARRIVE guidelines which were developed in order to improve the quality of study conduct and reporting of animal trials [45] as well as animal welfare authorities [46] require formal justification for sample size selection. Group size should be appropriate to detect a certain effect with adequate power while simultaneously ensuring that no more animals than necessary are used [46]. In practice, power calculations for preclinical trials are challenging for a number of reasons. For instance, information about the variance in the response variable might not be available a priori, however this issue might be tackled by performing a small scale pilot study. Another problem may be that the estimated effect is small while the variance in the selected endpoint is high, which results in such large group sizes that might not be acceptable for animal welfare regulators. One possible way to address this problem is to identify methods which are associated with higher power in small samples or try to reduce the variability in the response variables by possibly including other covariates in the analysis [47]. A recent development in the effort to increase power of animal studies includes performing systematic reviews and meta-analysis of existing studies [48]. This approach is well established in clinical research and it allows scientists to appraise estimated effect sizes more systematically and put them in the context of existing reports. The majority of preclinical meta-analyses which have been performed in the field of neurotrauma so far are related to experimental stroke (e.g. [49-54]). However pre-clinical meta-analyses on e.g. spinal cord injury [55, 56] and subarachnoid hemorrhage [57] have also been published. However, since a meta-analysis is not always practicable, especially when a novel study is conducted, we investigated if ordination techniques might be useful to detect treatment effect patterns with small sample sizes. Multivariate techniques classically rely on data sets consisting of more observations than variables, which is not always the case in animal studies especially in the omics era. Therefore, we first evaluated if LDA, PCA, PLS-DA, or RDA falsely report non-existing patterns in simulated data sets without treatment effects. With 5 measurements per group and 18 variables, LDA was associated with a false positive rate of 38.7% while PCA, PLS-DA, and RDA did not capture false patterns in the data. The extreme over-fitting we observed for LDA is due to multicollinearity in the data set (see S1 Appendix of Table 2 for the correlation matrix used for simulating multivariate data sets) combined with a small sample size [58]. While this is not necessarily a novel finding, our simulation results highlight the dangers of carelessly applying a dimensionality reducing technique to multivariate data sets with more variables than measurements, which often leads to false inferences. In contrast, PCA is capable of overcoming the “large p, small n” problem by reducing the large number of variables to a few uncorrelated components. The method only imposes the constraint that the first component captures the direction of greatest variance in the data hyper-ellipsoid [32] and does not perform regression or classification of data. Therefore multicollinearity poses no issue. However, group assignment is ignored and so differences between groups do not necessarily become apparent in reduced space. RDA is the supervised version of PCA and it imposes the constrain that the dependent variables in reduced space are linear combinations of the grouping variable. Surprisingly, RDA demonstrated only a slightly increased sensitivity to detect true treatment effect patterns in our simulations compared to PCA. Conversely, PLS-DA clearly outperformed both PCA and RDA. Although PLS-DA uses the quantitative variables to predict group membership similarly to classical LDA, classification is performed after dimensionality reduction [59]. PLS-DA thereby overcomes the problem of multicollinearity and simultaneously tries to maximize group differences, which was the most effective strategy in our simulations. Nevertheless, differences between methods only became apparent when we simulated huge treatment effects (Cohen’s d equal to 2.0). However, in our practical example of combining ordination techniques with statistical testing methods to investigate treatment effects, PLS-DA managed to identify all variables with simulated treatment effects as important for the observed multivariate pattern. Follow-up statistical tests did not capture all differences successfully. PLS-DA might therefore be a useful strategy to preselect important endpoints for targeted statistical testing with the goal of reducing the overall number of tests.

Conclusion

Assessing therapeutic success in preclinical neurotrauma studies remains challenging when small samples are combined with small effect sizes. Our simulation study demonstrated that linear mixed effects models offer a slightly increased power in case of equal variance whereas Welch’s ANOVA should be used when homogeneity of variance is not present. Additionally, PLS-DA offers a higher sensitivity to detect treatment effect patterns than PCA and RDA, whereas classical LDA leads to overfitting and false inferences in multivariate data sets with few measurements per group. Although we based our simulation on a real neurotrauma preclinical study, our findings might be more generally applicable to multivariate data sets with a similar correlation structure as we applied standardized measures of effect sizes which are not restricted to a specific endpoint or type of study. Ultimately, translational success of animal trials in neurotrauma would greatly benefit from appropriate sample size calculation prior to conduct of the study. When this is not feasible, it is advantageous to re-evaluate estimates of treatment effect with combined evidence from existing studies (if available) by performing systematic reviews and meta-analyses.

The file contains the mean and variance vector of the simulated control group and the correlation matrix used to sample data from multivariate distributions under different simulation scenarios.

Figs 1–4 show comparisons of type I error rate and empirical power of the four different multivariate statistics used to evaluate the significance of MANOVA tests. (PDF) Click here for additional data file.

Comparison of ordination techniques to detect treatment effect patterns when no treatment effects were simulated.

The file contains the results from 1000 simulated data sets without treatment effects, 5 measurements per group with data obtained from a multivariate normal distribution with equal variance in all groups. LDA, PCA, RDA, or PLS-DA were considered to falsely capture a non-existing treatment effect pattern if the 95% confidence ellipse around the control group did not overlap with the confidence ellipses of treatment groups (dose1 to dose3). (PDF) Click here for additional data file.

Comparison of ordination techniques to detect treatment effect patterns with huge simulated treatment effects (Cohen’s d equal to 2.0).

The file contains results from 1000 simulated data sets with 5 measurements per group and data obtained from a multivariate normal distribution with equal variance in all groups. PCA, RDA, or PLS-DA were considered to correctly capture a treatment effect pattern if the 95% confidence ellipse around the control group did not overlap with the confidence ellipses of the treatment groups (dose 1 to dose3). (PDF) Click here for additional data file. 11 Feb 2020 PONE-D-19-16128 Applying univariate vs. multivariate statistics to investigate therapeutic efficacy in controlled preclinical neurotrauma trials: A Monte Carlo simulation study PLOS ONE Dear Dr Gerber, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. In particular, one of the reviewers brought up substantive objections to the approach outlined in your work; it would be particularly helpful for you to address those concerns directly. Furthermore, other reviewers asked to clarify point of methodology in the abstract and main text of the manuscript. Please be sure to address those as well. We would appreciate receiving your revised manuscript by Mar 27 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'. Please note while forming your response that, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, Marco Bonizzoni, Ph.D. Academic Editor PLOS ONE Journal requirements: When submitting your revision, we need you to address these additional requirements: 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.plosone.org/attachments/PLOSOne_formatting_sample_main_body.pdf and http://www.plosone.org/attachments/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Thank you for stating the following in the Acknowledgments Section of your manuscript: 'The work of HT was funded by Fresenius Kabi Deutschland GmbH. The work of ESW was funded by the Center for Computational Sciences in Mainz (CSM). The work of SG was partly supported by the CRC 1193.' We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." Additionally, because some of your funding information pertains to commercial funding, we ask you to provide an updated Competing Interests statement, declaring all sources of commercial funding. In your Competing Interests statement, please confirm that your commercial funding does not alter your adherence to PLOS ONE Editorial policies and criteria by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests. If this statement is not true and your adherence to PLOS policies on sharing data and materials is altered, please explain how. Please include the updated Competing Interests Statement and Funding Statement in your cover letter. We will change the online submission form on your behalf. 3. Thank you for providing the following Funding Statement: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." We note that one or more of the authors is affiliated with the funding organization, indicating the funder may have had some role in the design, data collection, analysis or preparation of your manuscript for publication; in other words, the funder played an indirect role through the participation of the co-authors. If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study in the Author Contributions section of the online submission form. Please make any necessary amendments directly within this section of the online submission form. Please also update your Funding Statement to include the following statement: “The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.” If the funding organization did have an additional role, please state and explain that role within your Funding Statement. Please also provide an updated Competing Interests Statement declaring this commercial affiliation along with any other relevant declarations relating to employment, consultancy, patents, products in development, or marketed products, etc. Within your Competing Interests Statement, please confirm that this commercial affiliation does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests). If this adherence statement is not accurate and there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared. Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: No Reviewer #3: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: No Reviewer #3: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Thank you for inviting me to review the manuscript PONE-D-19-1628 entitled “Applying univariate vs. multivariate statistics to investigate the therapeutic efficacy in controlled preclinical neurotrauma trials: A Monte Carlo simulation study”. Motivated by a clinical trial, the manuscript studied the empirical power of different statistics analyzing the trial data with multiple correlated endpoints with repeated measures. In general ,the manuscript was well written and has many merits: addressing a real clinical issue, conducting a simulation study, providing good ground for manipulated factors, visual examination of treatment effect, clear description of software used providing reproducibility options, good definition of the evaluating criteria of assessing the models and statistics. There are some improvements needed before it’s ready for publication. Abstract It is not clear what “acceptable level” of power means. What does “20 measurements per group” mean? Does it mean 20 subjects per group? Methods In general the simulation scenario needs to be clarified: did the author generated 4 groups (1 control, 3 treatments of different doses) and 7 endpoints each with 3 times of measurements (page 5)? Or did the authors generated 2 groups of 1 control and 1 treatment as Hotelling’s T square was used in the MANOVA analysis (page 10). If the simulation is in 4 groups, how is effect size Cohen’s d defined? Is it defined as difference between each two groups? In general, how the MANOVA multivariate responses are defined is not clear: is it the different time points of a specific endpoints that are treated are multivariate (page 7)? Or is it the 7 different endpoints that are treated as multivariate for a specific time? From Appendix 1, it looks like both the endpoints and time points are used as different response, altogether 19 columns. Please clarify in the method section. Simulation procedure: the way simulation was done is a good representation of the clinical trial by bootstrapping method, yet it can also make the results of the study biased to a specific trial and limit its generalizability. Simulation factors: How was the simulation scenarios of 24 defined? There were 4 levels in n, 4 levels in ES (0, .2, .5, .8), 3 levels in variance. If effect size = 0 is not treated as a simulation condition, details need to be elaborated. What about the distribution of dependent variables that also include log transformation? Multivariate dimensionality reduction techniques for pattern analysis: please explain why is huge treatment effects (Cohen’s d=2.0) chosen as an example. Reviewer #2: This paper seeks to encourage potentially more appropriate analysis of data from preclinical experiments involving multiple outcomes and multiple experimental groups. The main hypothesis is that a multivariate consideration of the outcomes rather than multiple univariate tests may be more powerful and the goal is to identify which multivariate method might be most useful via a simulation study. The goal of the paper is laudable, but there are several issues with the approach and holes that limit the validity of the conclusions. The first set of issues is related to the comparison of univariate to multivariate tests with respect to type I error. The way empirical type I error is calculated is ill-conceived. As the authors state, the univariate tests will maintain close to nominal levels of type I error on a test-by-test basis. In practice, the concern would be for the case when there really is no effect, but a few outcomes have (unadjusted) p<0.05. That should be the comparison. How often does a set of univariate tests give a "wrong result" in terms of concluding the treatment has an effect based on one or more significant p-values (if that is the rule for finding a significant difference). But if one were to accept 1 or more significant p-values among any of the multiple tests as indicating difference, standard practice would be to adjust the p-values using a Bonferroni adjustment or some other method of controlling the type I error rate. There is a similar issue with how power is calculated. I am also confused by the combination of repeated measurement of multiple outcomes into a single vector without taking any of that information into account in most of the multivariate analyses. In this setting, it would be (somewhat) uncommon to do univariate on all items separately. Mixed effects regression or repeated measures ANOVA would be the choice to make, and mixed effects models for multivariate repeated measures do exist. It would have been helpful to consider these alternatives. It would seem that ignoring knowledge about the data structure that comes from the experimental design might also severely hamper the performance of the PCA ANOVA and dimensionality reduction techniques. Another issue with multivariate methods that is not addressed is missing data. MANOVA cannot handle data that are missing at random while mixed effects models can. Although not a focus of this analysis, the limitations of MANOVA in this regard would suggest opting for a more flexible method like mixed effects models for comparative purposes. Perhaps the most important question is whether multivariate techniques, even if they improve power in some modest way, are of value in the preclinical setting since, as the authors state in the introduction, they have "increased complexity of analysis and interpretation of results." If separate modeling of each outcome provides an accurate representation of the effect of the treatment on that outcome, does a multivariate p-value or a data reduction technique help if we can't easily interpret the effect? Perhaps coupling a data analysis example to the simulations where all of the methods were applied to a real data set would help to clarify the analytic methods that were actually applied and the issues in interpretation that come with the methods. Reviewer #3: The paper evaluates the performance of univariate ANOVA and Welch’s ANOVA tests versus multivariate techniques based on the simulation study, taking into account sample size/effect size, normality and homogeneity of variance. The idea makes sense intuitively (according to the statistical textbook/theory) and the result may be helpful for some researchers in application. However, the methodology is not novel and the broadness of the application may be not enough. It may be helpful to medical researchers. I have some concerns and comments as follows. (1) I assume that this is more like a statistical research paper, not medical research paper (2) It is not clear why does the title of this paper include “in controlled preclinical neurotrauma trials”? It seems that the result of the simulation study can be applied to different trials (or clinical trials), not just neurotrauma trials only. (3) The results from the simulation study show that Welch’s ANOVA is as powerful as classical ANOVA tests with variance homogeneity and outperformed the remaining methods when this assumption was violated. However, most animal data are much more homogeneous (less variation), in comparison with the clinical data (human being). That is to say, the result from the simulation study may be helpful to a small clinical study (bigger variation), not just preclinical trials (smaller variation). (4) In simulation factors section, a sample size of 5 – 20 may be too small. It will be interesting to see more scenarios with a ranged from 5 – 50 (say) to benefit more people (similar to my comment (2)) (5) In simulation factors section, the correlation in multivariate normal distribution of dependent variables is missing (which is important) (6) In the simulation study, you may generate the data from other distributions (not normal/ log-normal distribution. This can be another factor in the simulation study. For example, a gamma distribution or Weibull distribution. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. 6 Mar 2020 Dear Editor, Dear Reviewers, we would like to take the opportunity to thank you for giving us the chance to revise our Manuscript and for your very valuable comments giving us the chance to improve our work. We have addressed all the reviewers’ comments. A point-by-point response to the individual reviewers’ questions is provided below. Please find the comments also in our "Letter of response" in a clearer format and with highlighted paragraphs. Sincerely, Susanne Gerber, on behalf of the authors. Reviewer #1 Abstract It is not clear what “acceptable level” of power means. Usually, the acceptable level of power (also the value used for sample size estimations) is 80%. We have adapted the text to make this understandable. What does “20 measurements per group” mean? Does it mean 20 subjects per group? In the course of the manuscript, we have used the terms “measurements per group”, “subjects per group” and “replicates per group” interchangeably. We have included a statement in the methods section to make this more clear. Methods In general the simulation scenario needs to be clarified: did the author generated 4 groups (1 control, 3 treatments of different doses) and 7 endpoints each with 3 times of measurements (page 5)? Or did the authors generated 2 groups of 1 control and 1 treatment as Hotelling’s T square was used in the MANOVA analysis (page 10). If the simulation is in 4 groups, how is effect size Cohen’s d defined? Is it defined as difference between each two groups? We apologize for these ambiguities. We simulated 4 groups in each case, one of these groups was considered to be a control group and the remaining 3 groups were considered to be treatment groups. The effect size was always considered as the difference between the control group and each of the treatment groups. In the statistical analysis, all 4 groups were considered simultaneously. In the MANOVA analysis, we actually used the Lawley-Hotelling-statistic which is a generalization of the Hotelling’s T square statistic and is calculated as trace(E-1H) where E denotes the error matrix and H denotes the hypothesis matrix. Alternatively, the Lawley-Hotelling statistic is also equal to the sum of the eigenvalues of the (E-1H)-matrix. We have updated the name of this test statistic to Lawley-Hotelling trace in the revised version of the manuscript to avoid a possible confusion. In general, how the MANOVA multivariate responses are defined is not clear: is it the different time points of a specific endpoints that are treated are multivariate (page 7)? Or is it the 7 different endpoints that are treated as multivariate for a specific time? From Appendix 1, it looks like both the endpoints and time points are used as different response, altogether 19 columns. Please clarify in the method section. Our data set included 6 variables, each of these 6 variables was measured at three different time points. A seventh variable was measured only once, but we have excluded it from the revised manuscript as we investigated additional methods for repeated measures. A separate MANOVA test was performed for each endpoint. The repeated measures at the three time points served as dependent variables for each MANOVA test. Thus, we performed 6 MANOVA tests with three dependent variables each. In the univariate ANOVA tests, each repeated measure was considered as a separate variable.We have tried to make this more clear in the revised version of the manuscript. Simulation procedure: the way simulation was done is a good representation of the clinical trial by bootstrapping method, yet it can also make the results of the study biased to a specific trial and limit its generalizability. We thank the reviewer for this important comment. We agree that our approach might limit the generalizability of the results. Since there are countless options for a number of variables, mean vector and correlation structure for a multivariate data set, we decided to base the simulation procedure on a real study, in order to have simulated data as realistic as possible. In order to increase the generalizability, we did not directly use the mean vector and correlation matrix of the original data but used the bootstrap procedure which should at least give us more population specific and not just sample specific estimates of the parameters used for the simulations. Simulation factors: How was the simulation scenarios of 24 defined? There were 4 levels in n, 4 levels in ES (0, .2, .5, .8), 3 levels in variance. If effect size = 0 is not treated as a simulation condition, details need to be elaborated. What about the distribution of dependent variables that also include log transformation? We thank the reviewer for pointing out this inconsistency. When calculating the number of simulation scenarios we mistakenly overlooked the factor of sample size which resulted in this incorrect number. In the revised version of the manuscript, we have performed additional simulations with data following a multivariate gamma distribution and sample sizes of 30, 40 and 50 subjects per group. Therefore, we now actually have 252 scenarios (with 3 different distributions, 4 different effect sizes, 3 different variance ratios and 7 different sample sizes per group). Multivariate dimensionality reduction techniques for pattern analysis: please explain why is huge treatment effects (Cohen’s d=2.0) chosen as an example. In this example, we simulated 5 subjects per group and Cohen’s d = 2.0 was the lowest value for the effect size for which we observed a difference between the ordination methods, indicating that smaller effect sizes cannot be detected given a sample size of 5 subjects per group. We have included this in the revised manuscript. Reviewer #2 The first set of issues is related to the comparison of univariate to multivariate tests with respect to type I error. The way empirical type I error is calculated is ill-conceived. As the authors state, the univariate tests will maintain close to nominal levels of type I error on a test-by-test basis. In practice, the concern would be for the case when there really is no effect, but a few outcomes have (unadjusted) p<0.05. That should be the comparison. How often does a set of univariate tests give a "wrong result" in terms of concluding the treatment has an effect based on one or more significant p-values (if that is the rule for finding a significant difference). But if one were to accept 1 or more significant p-values among any of the multiple tests as indicating difference, standard practice would be to adjust the p-values using a Bonferroni adjustment or some other method of controlling the type I error rate. There is a similar issue with how power is calculated. We believe we used the standard approach to calculate the empirical type I error rate and power of the different tests. Furthermore, we believe our strategy complies with what the reviewer described as to how type I error rate and power should be estimated. We determined type I error rate when no treatment effects were simulated as in this case no significant differences should be detected. For the univariate tests, for example, we performed 18 ANOVA tests in each simulation round (repeated 1000 times under each simulated scenario). Each time, we determined the fraction of these 18 tests that were significant and finally reported the average fraction over the 1000 simulations as the final estimate of type I error rate. This is mathematically equivalent to counting the number of significant tests and dividing that by the number of total tests performed. For MANOVA tests, for example, we performed 6 MANOVAs each time and determined the average fraction of these MANOVA tests that were significant over the 1000 simulations. We believe that normalizing the type I error rate and power to the number of tests performed is the only objective way to compare univariate and multivariate methods. Otherwise, the difference in type I error rate or power would be attributable to the difference in the number of tests performed. In practice, a p-value adjustment method would indeed be applied to control the family wise error rate or the false discovery rate. In our simulations, we did not apply a p-value adjustment, because we knew the ground truth. Therefore, each time a p-value was below 0.05 we knew if this was correct depending on the treatment effect we simulated. We hope our strategy is now more understandable and acceptable to the reviewer. I am also confused by the combination of repeated measurement of multiple outcomes into a single vector without taking any of that information into account in most of the multivariate analyses. In this setting, it would be (somewhat) uncommon to do univariate on all items separately. Mixed effects regression or repeated measures ANOVA would be the choice to make, and mixed effects models for multivariate repeated measures do exist. It would have been helpful to consider these alternatives. It would seem that ignoring knowledge about the data structure that comes from the experimental design might also severely hamper the performance of the PCA ANOVA and dimensionality reduction techniques. We thank the reviewer for this important comment. We believe that it is not that uncommon in the field of animal studies to perform separate ANOVA tests for each time point for repeated measures variables. Furthermore, while the endpoints in the original study we based our simulations on, were measured repeatedly, in the context of our simulations this more generally translates to endpoints which follow a certain correlation structure. Thus, the endpoints which were originally measured repeatedly correspond to variables which are more strongly correlated among each other than with other variables. For this reason, we believe it is a legitimate approach to analyze them using separate ANOVA tests or MANOVA tests in our simulations. Nevertheless, in practice, is does seem imprudent to ignore the repeated measure nature of data. Therefore, we have also included linear mixed effects analysis of repeated measures in the revised version of the manuscript. Since the original data included one endpoint measured only once, we have excluded it from the updated analysis. Regarding the dimensionality reduction techniques, we believe that their performance is not severely limited by repeated measures data. The assumption is that repeated measures are correlated with each other and since the dimensionality reduction techniques are calculated based on correlation matrices, this information is implicitly taken into account. For instance, variables with repeated measures often load on the same component in the ordination procedure. Another issue with multivariate methods that is not addressed is missing data. MANOVA cannot handle data that are missing at random while mixed effects models can. Although not a focus of this analysis, the limitations of MANOVA in this regard would suggest opting for a more flexible method like mixed effects models for comparative purposes. We agree with the reviewer that missing data might pose a significant limitation for multivariate techniques. If the level of missingness is not great (e.g. less than 10%) and the missing at random condition is satisfied, then these missing values might be imputed and a multivariate technique still applied. If not, a more flexible method such as linear mixed effects models would be the natural choice. While a systematic investigation of different degrees of missingness and imputation techniques is beyond the scope of our current study, we have included these considerations in the discussion of the revised manuscript. Perhaps the most important question is whether multivariate techniques, even if they improve power in some modest way, are of value in the preclinical setting since, as the authors state in the introduction, they have "increased complexity of analysis and interpretation of results." If separate modeling of each outcome provides an accurate representation of the effect of the treatment on that outcome, does a multivariate p-value or a data reduction technique help if we can't easily interpret the effect? Perhaps coupling a data analysis example to the simulations where all of the methods were applied to a real data set would help to clarify the analytic methods that were actually applied and the issues in interpretation that come with the methods. We agree with the reviewer that our results indicate that multivariate techniques such as MANOVA do not offer a practical benefit in preclinical studies for formal statistical testing as our results indicated that the gain in power compared to ANOVA tests was not nearly sufficient enough to justify the complexity of interpretation. Furthermore, in the updated simulations, linear mixed effects models outperformed the remaining methods with repeated measures data. We have updated the discussion and conclusion to reflect these findings. However, we believe that dimensionality reduction techniques are useful beyond formal hypothesis testing for data exploration purposes. We have tried to demonstrate this by including a practical data analysis example with one simulated data set in the revised version of the manuscript (Fig. 7 and table 1). When we simulated treatment effects on only half of the variables, PLS-DA captured the multivariate pattern in the data and managed to identify the variables with simulated treatment effects as important for group separation in reduced multivariate space. We hope this applied example demonstrates the usefulness of these methods for data exploratory purposes whereas we suggest that linear mixed effects models or ANOVA tests are more appropriate for formal hypothesis testing. Reviewer #3 (1) I assume that this is more like a statistical research paper, not medical research paper Our goal in the current study was to systematically compare the performance of a number of univariate and multivariate techniques by manipulating factors justified by the statistical assumptions of the different tests. As such, our investigations are more statistical in nature. While the practical implications of our results could hopefully be useful to the medical research community when confronted with similar data, we believe that applying especially the multivariate techniques requires a solid statistical background which pure experimentalists might lack. Thus, our results would hopefully be helpful to biostatisticians and data analysts who also have a good understanding of the theoretical background of the statistical methods. (2) It is not clear why does the title of this paper include “in controlled preclinical neurotrauma trials”? It seems that the result of the simulation study can be applied to different trials (or clinical trials), not just neurotrauma trials only. We thank the reviewer for acknowledging the potential of our study to have a broader application than simply preclinical neurotrauma studies. We chose to be more cautious and limit ourselves to this type of studies as we based the simulations on data from a real neurotrauma study. However, since we used standardized measures for treatment effects, our findings could theoretically have broader applications. We have discussed this in the revised version of the manuscript and also slightly changed the title. (3) The results from the simulation study show that Welch’s ANOVA is as powerful as classical ANOVA tests with variance homogeneity and outperformed the remaining methods when this assumption was violated. However, most animal data are much more homogeneous (less variation), in comparison with the clinical data (human being). That is to say, the result from the simulation study may be helpful to a small clinical study (bigger variation), not just preclinical trials (smaller variation). We agree with the reviewer that our study results might be useful to a small clinical trial where data are usually more heterogenous than animal data. Furthermore, we have implemented the reviewer’s suggestion to simulate bigger group sizes (e.g. 30, 40 and 50 replicates per group) and we have also included simulations with data coming from a gamma distribution thereby hopefully increasing the impact of our findings. (4) In simulation factors section, a sample size of 5 – 20 may be too small. It will be interesting to see more scenarios with a ranged from 5 – 50 (say) to benefit more people (similar to my comment (2)) As the reviewer helpfully suggested, we have extended the simulation procedure in the revised manuscript to also include simulations with 30, 40 and 50 subjects per group in order to increase the impact of our research and hopefully provide beneficial results to a broader circle of researchers. (5) In simulation factors section, the correlation in multivariate normal distribution of dependent variables is missing (which is important) We thank the reviewer for this observation. We have included a reference to where the correlation matrix of the dependent variables can be found in the simulation factors section (S1 Appendix Table 2). (6) In the simulation study, you may generate the data from other distributions (not normal/ log-normal distribution. This can be another factor in the simulation study. For example, a gamma distribution or Weibull distribution. According to this helpful suggestion, we have also included simulations with data coming from a multivariate gamma distribution in the revised version of the manuscript. Submitted filename: Response letter_Revision.docx Click here for additional data file. 10 Mar 2020 Applying univariate vs. multivariate statistics to investigate therapeutic efficacy in (pre)clinical trials: A Monte Carlo simulation study on the example of a controlled preclinical neurotrauma trial. PONE-D-19-16128R1 Dear Dr. Gerber, After evaluation of the revised version of your manuscript you recently submitted, we are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements. Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication. Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. With kind regards, Marco Bonizzoni, Ph.D. Academic Editor PLOS ONE 12 Mar 2020 PONE-D-19-16128R1 Applying univariate vs. multivariate statistics to investigate therapeutic efficacy in (pre)clinical trials: A Monte Carlo simulation study on the example of a controlled preclinical neurotrauma trial. Dear Dr. Gerber: I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. For any other questions or concerns, please email plosone@plos.org. Thank you for submitting your work to PLOS ONE. With kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Marco Bonizzoni Academic Editor PLOS ONE

40 in total

1. Docosahexaenoic acid, but not eicosapentaenoic acid, reduces the early inflammatory response following compression spinal cord injury in the rat.

Authors: Jodie C E Hall; John V Priestley; V Hugh Perry; Adina T Michael-Titus
Journal: J Neurochem Date: 2012-04-12 Impact factor: 5.372

2. Comparative analysis of lesion development and intraspinal inflammation in four strains of mice following spinal contusion injury.

Authors: Kristina A Kigerl; Violeta M McGaughy; Phillip G Popovich
Journal: J Comp Neurol Date: 2006-02-01 Impact factor: 3.215

3. Ethics and animal numbers: informal analyses, uncertain sample sizes, inefficient replications, and type I errors.

Authors: Douglas A Fitts
Journal: J Am Assoc Lab Anim Sci Date: 2011-07 Impact factor: 1.232

4. Small sample inference for fixed effects from restricted maximum likelihood.

Authors: M G Kenward; J H Roger
Journal: Biometrics Date: 1997-09 Impact factor: 2.571

5. Development of a database for translational spinal cord injury research.

Authors: Jessica L Nielson; Cristian F Guandique; Aiwen W Liu; Darlene A Burke; A Todd Lash; Rod Moseanko; Stephanie Hawbecker; Sarah C Strand; Sharon Zdunowski; Karen-Amanda Irvine; John H Brock; Yvette S Nout-Lomas; John C Gensel; Kim D Anderson; Mark R Segal; Ephron S Rosenzweig; David S K Magnuson; Scott R Whittemore; Dana M McTigue; Phillip G Popovich; Alexander G Rabchevsky; Stephen W Scheff; Oswald Steward; Grégoire Courtine; V Reggie Edgerton; Mark H Tuszynski; Michael S Beattie; Jacqueline C Bresnahan; Adam R Ferguson
Journal: J Neurotrauma Date: 2014-07-31 Impact factor: 5.269