Literature DB >> 25343012

Analysis of Statistical Methods Currently used in Toxicology Journals.

Jihye Na¹, Hyeri Yang¹, SeungJin Bae¹, Kyung-Min Lim¹.

Abstract

Statistical methods are frequently used in toxicology, yet it is not clear whether the methods employed by the studies are used consistently and conducted based on sound statistical grounds. The purpose of this paper is to describe statistical methods used in top toxicology journals. More specifically, we sampled 30 papers published in 2014 from Toxicology and Applied Pharmacology, Archives of Toxicology, and Toxicological Science and described methodologies used to provide descriptive and inferential statistics. One hundred thirteen endpoints were observed in those 30 papers, and most studies had sample size less than 10, with the median and the mode being 6 and 3 & 6, respectively. Mean (105/113, 93%) was dominantly used to measure central tendency, and standard error of the mean (64/113, 57%) and standard deviation (39/113, 34%) were used to measure dispersion, while few studies provide justifications regarding why the methods being selected. Inferential statistics were frequently conducted (93/113, 82%), with one-way ANOVA being most popular (52/93, 56%), yet few studies conducted either normality or equal variance test. These results suggest that more consistent and appropriate use of statistical method is necessary which may enhance the role of toxicology in public health.

Entities: CellLine Chemical Disease Gene Species

Keywords: Biostatistics; Descriptive statistics; Inferential statistics; Standard deviation; Standard error of mean; Toxicology

Year: 2014 PMID： 25343012 PMCID： PMC4206745 DOI： 10.5487/TR.2014.30.3.185

Source DB: PubMed Journal: Toxicol Res ISSN： 1976-8257

INTRODUCTION

Inappropriate use of statistical methods are frequently witnessed in clinical and biology journals (1-4). This issue is especially of critical importance in toxicology, since inferences drawn from statistics are frequently referred for enacting regulation, establishing guidance or determination of drug safety. However, in many published papers in toxicology journals, sample size and statistical methods have been frequently selected without proper justification. In clinical studies, some guidelines regarding the sample size has been referred (either 25 or 30 as a cut-off for being large sample, (5,6)), yet no obvious guidance has been provided in the area of experimental statistics, to our best knowledge. Granting that unlike clinical studies where participants diverge greatly, the baseline characteristics of the experimental animals (inbred or outbred) or (cultured or mono-clonal) cells are tightly controlled and homogeneous; however, small sample size still raise concerns in cases of employing parametric methods for inferential statistics to determine group differences. Data are usually described and represented by measuring the central tendency and the dispersion of the data; although mean (average of variables), median (value located in middle of the list of variables), and mode (most often occurring value) are all valid measures of central tendency, one becomes more appropriate to use than others, based on the skewedness of the distribution of variables. Mean, the most popular measure to central tendency, is especially vulnerable to outliers, thus median might be a better measure for central tendency. In the description of data dispersion, standard deviation (SD) and standard error mean (SEM) are often confusedly used (2-4), even though their logical purposes are clearly different. SD is a measure of variability of individual values, whereas SEM represents the precision of sample means. SD should be used to describe the variability of a parameter (3) whereas SEM should be used for hypothesis testing for sample mean. Yet, since SEM drops as sample size increases, and accordingly, the size of SEM is usually less than SD making the data appear more reproducible and consistent. Thus, using SEM could underestimate the variability of individual parameter (7). When inferential statistics are conducted, a null hypothesis is tested and the hypothesis is either rejected or failed to be rejected based on test statistics. At this point, if assumptions of normality are not met (for example, data are severely skewed or sample sizes are small) then the mean and SD of the data do not properly represent either central tendency or the dispersion of the data (6). Hence, non-parametric approach, which makes fewer assumptions, although less efficient, should be considered. However, normality assumptions are often not conducted and parametric approach is inattentively used. Here, we sampled 30 papers published in major 3 journals in toxicology field in 2014 which have assessed group differences with statistics. The characteristics of data presented were analyzed, categorized and examined. There are more than 110 endpoints or outcome measures in those 30 papers and we reviewed sample size, presentation of the data (SD or SEM) and the employed inferential statistical methods. We also discussed about the logics for selection and appropriateness of statistical methods to give an insight into the proper selection of statistical methods.

MATERIALS AND METHODS

Thirty articles with the purpose of examining group differences from toxicant exposure have been selected, that were published in Toxicology and Applied Pharmacology (11 papers (8-17)), Archives of Toxicology (8 papers (18-25)), and Toxicological Science (11 papers, (26-36)) in 2014. These three journals were high-ranked in the category of Toxicology of Thompson-Reuter, JCR 2012. In the selection of 30 articles, diversity in statistical methods were considered and papers with unclear explanation for statistics were excluded. From these 30 papers, outcome measures (endpoints) were extracted and reviewed for data type, sample size and method of descriptive and inferential statistics. The data has been classified as numeric vs. categorical, based on the outcome measures shown in the paper. Numeric data has been categorized as continuous vs, discrete, and categorical variable was classified as nominal vs. ordinal. Endpoints were classified as “descriptive statistics” if only descriptive statistics (and not inferential statistics) were presented, whereas classified as inferential statistics if p value has been presented and conclusion was drawn based on the p-value. If the method used to estimate p-value could not be identified, then it was classified as “could not be determined”. Parametric methods include those tests which assumed that the data are normally distributed; normality test includes Kolmogorov-Smirnov, D’Agostino and Pearson omnibus and Shapiro-Wilk normality tests; equal variance test, which is required for ANOVA (Analysis of Variance) includes Levene’s test and Bartlett’s test.

RESULTS

Of 30 papers being selected, 113 outcome measures (or endpoints) were observed (Table 1), and most of them were numeric variables (112/113, 99%) except for one outcome used “pain scale” measure, which falls into ordinal (categorical) variable. Of 112 numeric variables, 109 variables were continuous while 3 outcomes counted the number of neutrophils, immune-stained cells, and number of visits, which could be categorized as discrete variables. 109 continuous variables could be further classified into relative values (68/109, 63% mostly relative to comparison group) and absolute values (41/109, 37%) but few studies have provided justification regarding the why the value has been described as relative or absolute value.

Table 1.

Data characteristics of endpoints

Numerical		Categorical

Continuous (absolute/relative)	Discrete	Ordinal
109 (41/68)	3	1
112		1

As shown in Fig. 1, the sample size distribution is heavily right-skewed, with most values clustered below 11 and the median, mean, and mode being 6, 16.4, and 3 & 6, respectively. There are some outliers which strongly influenced the mean, with sample size being 71, 184, and 694. The paper which employed large sample sizes of 184 and 694 were about neurotoxicity of heterocyclic amines but the definition of sample number were not conventional (number of neurites for the outcome measurement of neurite length, and mean pixel intensity in the regions of interests) (32) while the one used the average sample size of 71 were about the neurotoxicity of flame retardants and the sample number is the number of wells (of 96 well plate) per group (22). Besides these few outliers, most studies have small sample sizes, raising a concern that the studies may be underpowered to conduct inferential statistics and the dispersion of data might not indicate the true distribution of the underlying distribution of the outcome variables.

Fig. 1.

Distribution of sample size per study. (A) Box plot showing median (6), 25% and 75% quartile ranges around the median (box width) and, 5% and 95% of sample size per study (whiskers) and outliers (*), (B) Histogram showing distribution of sample size per study.

To describe the central tendency of the data, mean was most frequently used (93%, 105 outcomes), and 6 outcomes employed median (Fig. 2A), but none provided explanation regarding the selection of the summary statistics. Median survival time from survival analysis was presented in one study which measured mice’s survival (26) and levels in metabolites, mRNA and protein by bisphenol A exposure were presented as median with interquartile range in one paper (35). For the dispersion of the data, SEM/SE was most frequently used (66/113, 58%), followed by SD (39/113, 34%) (Fig. 2B). Interquartile ranges were provided only for 4 outcomes (35,37), all of whom used median and box-whisker plots for data description, and dispersion was not presented in 4 endpoints.

Fig. 2.

Distribution of measures used to describe summary statistics. (A) shows the distribution of the measures used to describe central tendency of the data (endpoints). (B) shows the distribution of measures used to describe dispersion of the data (endpoints). Legend: SD, standard deviation; SE, standard error; SEM, standard error of the mean.

Conspicuously, majority of outcome measures were analyzed by inferential statistics (93/113, 82%) (Fig. 3A), most probably because the original purpose of the studies aimed to examine the difference between the group means or to examine the effects of toxicants. Those that did not conducted inferential statistics (20/113), were mostly describing just observations or baseline characteristics where the determination of group difference was not necessary. For inferential statistics, parametric methods were frequently employed (parametric 77/93, 82.8%), while non-parametric methods (3/93, 3.2%) were rarely employed (Fig. 3B). Of 93 outcomes that underwent inferential statistics, two-group comparison was 29 and others (64/93, 68.8%) were multigroup (more than 2 groups) comparison. Interestingly, of 64 multi-group comparisons, ANOVA was conducted for 55, representing most favored inferential statistical method for multi-group comparisons.

Fig. 3.

Description of studies conducting inferential statistics. (A) Proportion of endpoints conducting inferential statistics; endpoints were classified as “descriptive statistics” if only descriptive statistics (and not inferential statistics) were presented, whereas classified as inferential statistics if inferential statistics were conducted, namely, p value has been presented and conclusion was drawn based on the p-value. (B) Classification of endpoints conducting inferential statistics. If the method used to estimate p-value could not be identified, then it was classified as “could not be determined”.

Fifteen out of ninety three endpoints that conducted inferential statistics underwent assumption test which is critical for selecting whether to use parametric or non-parametric methods. If we look further into parametric methods, one-way ANOVA is the most frequently employed (55/79, 69.6%) followed by t-test (20/79, 27%, Fig. 4A). Interestingly, for one-way ANOVA, homoscedasticity or normality assumptions were rarely tested (Fig. 4B, 51/52, 98.1%). We could observe that only one paper described that they conducted normality test before one-way ANOVA (36) while none of the endpoints conducted equal variance test (0/49, 0%).

Fig. 4.

Detailed description of studies conducting ANOVA (analysis of variance). (A) Analysis of studies conducting parametric method. (B) Analysis of 52 studies conducting one-way ANOVA.

DISCUSSION

Our study illustrates that in the sampled 30 articles published in Toxicological Sciences, Toxicology and Applied Pharmacology and Archives of Toxicology, which are topranked toxicology journals, inconsistent statistical methods were frequently used when describing and inferring data. Diverse methods were often mixed for the description of data; for example, mean and median were used but it is difficult whether they were used properly since we could not identify the distribution of variables in the paper. SD and SEM were used without proper explanation, too. Descriptive statistics sought to explain a given study sample, thus if the purpose of the study is to explain the variability within the sample, then SD should be used; if the purpose is to estimate how the mean of the sample is related to the mean of underlying population, then SEM is used. Since SEM is always smaller than SD, one might speculate that SEM has been frequently used where SM should be used instead to underestimate the variability of the sample, which may lead readers to assume smaller variation of the sample (3). Our study identified that SEM is more frequently used than SD, and both methods are selected without clear justification. Moreover, there is a high possibility of SEM being improperly used, yet further study should be conducted to substantiate the issue of SEM being inappropriately used. Our study also shows that many studies have small sample size; the mode of sample size being 3 & 6 and the median being 6, thus it is not clear whether the results obtained from the studies can be reproduced when the sample size has been expanded, since data with small sample size are especially vulnerable to outliers. Given that our study includes in vitro as well as in vivo studies, the median of the sample size could even diminish further if we limit our study to in vitro or in vivo only which warrants future study. To make an inference regarding the test result, then parametric or non-parametric methods can be used; parametric method, which includes ANOVA and t-test, are conventional and well-known, yet the sample is assumed to be normally distributed or approximately so. If the normality assumption is violated, then non-parametric method, which is sometimes called as distribution-free method, should be considered (6). Since the studies usually have small sample size (as shown in Fig. 1), the data might not conform to normality assumptions; yet our study shows that parametric approach was frequently used without testing relevant assumptions, (Fig. 4B) One-way ANOVA, which is the most frequently used parametric method when comparing multiple groups, requires homoscedasticity assumption. ANOVA, whose full-name is Analysis of Variance, compares within-groups variance with between-group variance, thus homoscedasticity assumption is critical, yet our study shows that the key assumption is frequently ignored. Statistical significance does not necessarily translate into clinical significance (38), implying that statistical difference does not indicate clinical (and possibly toxicological) differences. Thus, more attention should be paid when interpreting the statistical significance of the study result. Also, transforming absolute into relative values (such as use of change from baseline) or using absolute value should be determined based on the clinical (toxicological) interpretation of the results (5), yet our study shows that relative and absolute values were used together (as shown in Table 1) without proper explanation. Insufficient explanation of statistical method was prevalent, which prevented us from conducting more detailed explanation. For example, the study methods specify “the homogeneity of variance for all the data sets was firstly examined using Bartlett test, and then the data succeeding or failing to pass the test were analyzed by one-way analysis of variance or Kruskal–Wallis test”, respectively but it did not specify whether ANOVA was used (namely, test passed) or not (test failed) in the study (15). These studies were categorized as “inferential statistics conducted, yet parametric/non-parametric unclear (could not be determined)”, which might underestimate the overall performance of the study. Also, one should interpret our result with caution, since we sampled only 30 papers from the journals, thus our result may not represent the general trend in toxicology journals. However, our study attempted to illustrate that statistical part of the toxicological experimental studies are inconsistently used with insufficient explanation regarding the method selected for the analysis. It is deplorable to see the absence of relevant explanation on statistics considering the prestige of the journals in toxicology. However, considering that conclusions drawn from experimental studies provide critical evidence or ground in regulatory affairs, more consistent and transparent approach should be taken when analyzing the study results. It would be important to review more papers and to invite biostatisticians to discuss the appropriateness of the statistical methods employed in toxicology. Such a study will provide a good reference for establishing appropriate decision tree for selection of statistical methods, which is invaluable for improvement of the reliability of the results and advancement of toxicological research.

36 in total

1. Investigation of the in vitro toxicological properties of the synthetic cannabimimetic drug CP-47,497-C8.

Authors: Verena J Koller; Volker Auwärter; Tamara Grummt; Bjoern Moosmann; Miroslav Mišík; Siegfried Knasmüller
Journal: Toxicol Appl Pharmacol Date: 2014-03-28 Impact factor: 4.219

2. Chrysin, an anti-inflammatory molecule, abrogates renal dysfunction in type 2 diabetic rats.

Authors: Amjid Ahad; Ajaz Ahmad Ganai; Mohd Mujeeb; Waseem Ahmad Siddiqui
Journal: Toxicol Appl Pharmacol Date: 2014-05-18 Impact factor: 4.219

3. Dioxin-induced retardation of development through a reduction in the expression of pituitary hormones and possible involvement of an aryl hydrocarbon receptor in this defect: a comparative study using two strains of mice with different sensitivities to dioxin.

Authors: Tomoki Takeda; Junki Taura; Yukiko Hattori; Yuji Ishii; Hideyuki Yamada
Journal: Toxicol Appl Pharmacol Date: 2014-05-02 Impact factor: 4.219

4. 2-Amino-1-methyl-6-phenylimidazo[4,5-b]pyridine (PhIP) is selectively toxic to primary dopaminergic neurons in vitro.

Authors: Amy M Griggs; Zeynep S Agim; Vartika R Mishra; Mitali A Tambe; Alison E Director-Myska; Kenneth W Turteltaub; George P McCabe; Jean-Christophe Rochet; Jason R Cannon
Journal: Toxicol Sci Date: 2014-04-09 Impact factor: 4.849

5. MDMA impairs mitochondrial neuronal trafficking in a Tau- and Mitofusin2/Drp1-dependent manner.

Authors: Daniel José Barbosa; Román Serrat; Serena Mirra; Martí Quevedo; Elena Gómez de Barreda; Jesús Avila; Eduarda Fernandes; Maria de Lourdes Bastos; João Paulo Capela; Félix Carvalho; Eduardo Soriano
Journal: Arch Toxicol Date: 2014-02-13 Impact factor: 5.153

6. Aberrant microRNA expression likely controls RAS oncogene activation during malignant transformation of human prostate epithelial and stem cells by arsenic.

Authors: Ntube N O Ngalame; Erik J Tokar; Rachel J Person; Yuanyuan Xu; Michael P Waalkes
Journal: Toxicol Sci Date: 2014-01-15 Impact factor: 4.849

7. Effects of N-acetyl-L-cysteine on target sites of hydroxylated fullerene-induced cytotoxicity in isolated rat hepatocytes.

Authors: Yoshio Nakagawa; Toshinari Suzuki; Kazuo Nakajima; Akiko Inomata; Akio Ogata; Dai Nakae
Journal: Arch Toxicol Date: 2013-07-23 Impact factor: 5.153

8. Comparison of life-stage-dependent internal dosimetry for bisphenol A, ethinyl estradiol, a reference estrogen, and endogenous estradiol to test an estrogenic mode of action in Sprague Dawley rats.

Authors: Mona I Churchwell; Luísa Camacho; Michelle M Vanlandingham; Nathan C Twaddle; Estatira Sepehr; K Barry Delclos; Jeffrey W Fisher; Daniel R Doerge
Journal: Toxicol Sci Date: 2014-02-04 Impact factor: 4.849

9. Bleomycin-induced epithelial-mesenchymal transition in sclerotic skin of mice: possible role of oxidative stress in the pathogenesis.

Authors: Cheng-Fan Zhou; Deng-Chuan Zhou; Jia-Xiang Zhang; Feng Wang; Wan-Sheng Cha; Chang-Hao Wu; Qi-Xing Zhu
Journal: Toxicol Appl Pharmacol Date: 2014-04-12 Impact factor: 4.219

10. Aspirin-triggered resolvin D1 down-regulates inflammatory responses and protects against endotoxin-induced acute kidney injury.

Authors: Jiao Chen; Sreerama Shetty; Ping Zhang; Rong Gao; Yuxin Hu; Shuxia Wang; Zhenyu Li; Jian Fu
Journal: Toxicol Appl Pharmacol Date: 2014-04-04 Impact factor: 4.460

1 in total

1. Alternatives to statistical decision trees in regulatory (eco-)toxicological bioassays.

Authors: Felix M Kluxen; Ludwig A Hothorn
Journal: Arch Toxicol Date: 2020-03-19 Impact factor: 5.153

1 in total