Literature DB >> 29319501

Unit of analysis issues in laboratory-based research.

Nick R Parsons¹, M Dawn Teare², Alice J Sitch³.

Abstract

Many studies in the biomedical research literature report analyses that fail to recognise important data dependencies from multilevel or complex experimental designs. Statistical inferences resulting from such analyses are unlikely to be valid and are often potentially highly misleading. Failure to recognise this as a problem is often referred to in the statistical literature as a unit of analysis (UoA) issue. Here, by analysing two example datasets in a simulation study, we demonstrate the impact of UoA issues on study efficiency and estimation bias, and highlight where errors in analysis can occur. We also provide code (written in R) as a resource to help researchers undertake their own statistical analyses.

Entities: Chemical Disease Gene Species

Keywords: Science Forum; epidemiology; experimental design; global health; mixed-effects models; statistics

Mesh：

Year: 2018 PMID： 29319501 PMCID： PMC5762161 DOI： 10.7554/eLife.32486

Source DB: PubMed Journal: Elife ISSN： 2050-084X Impact factor: 8.140

Introduction

Defining the experimental unit is a key step in the design of any experiment. The experimental unit is the smallest object or material that can be randomly and independently assigned to a particular treatment or intervention in an experiment (Mead et al., 2012). The experimental unit (e.g. a tissue sample, individual animal or study participant) is the object a scientist wants to make inferences about in the wider population, based on a sample in the experiment. In the simplest possible experimental setting where each experimental unit provides a single outcome or observation, and only in this setting, the experimental unit is the same as both the unit of observation (i.e the unit described by the observed outcomes) and the unit of analysis (UoA) (i.e. that which is analysed). In general this will not always be the case, so care must be taken, both when planning and reporting research, to clearly define the experimental unit, and what data are being analysed and how these relate to the aims of the study. In laboratory based research in the biomedical sciences it is almost always the case that multiple observations or measurements are made for each experimental unit. These multiple observations, which could be simple replicate measurements from a single sample or observations from multiple sub-samples taken from a single sample, allow the variability of the measure and the stability of the experimental setting to be assessed. They improve the overall statistical power of a research study. However, multiple or repeat observations taken from the same experimental unit tend to be more similar than observations taken from different experimental units, irrespective of the treatments applied or when no treatments are applied. Therefore data within experimental units are likely to be dependent (correlated), whereas data from different experimental units are generally assumed to be independent, all other things being equal (i.e after removing the direct and indirect effects of the experimental interventions and setting). The majority of widely reported statistical methods (e.g. t-tests, analyses of variance, generalized linear models, chi-squared tests) assume independence between all observations in an analysis, possibly after conditioning on other observed data variables. If the UoA is the same as the experimental unit (i.e. a single observation or summary measure is available for each unit) then the independence assumption is likely to be met. However, many studies reported in the biomedical research literature using multilevel design, often also referred to as mixed-effects, nested or hierarchical designs (Gelman and Hill, 2007), or more complex structured designs, fail to recognise the fact that independence assumptions are unlikely to be valid, and thus the reported analyses are also unlikely to be valid. Statistical inferences made from such analyses are often highly misleading. UoA issues, as they are termed in the statistical literature (Altman and Bland, 1997), are not limited to biomedical laboratory studies, and are recognised as a major cause of concern more generally for reported analyses in bioscience and medicine (Aarts et al., 2014; Altman and Bland, 1997; Bunce et al., 2014; Fleming et al., 2013; Lazic, 2010; Calhoun et al., 2008; Divine et al., 1992), and also feed into widely acknowledged issues around the lack of reproducibility and repeatability of much biomedical research (Academy of Medical Sciences, 2017; Bustin and Nolan, 2016; Ioannidis et al., 2014; McNutt, 2014). The RIPOSTE (Reducing IrreProducibility in labOratory STudiEs) framework was established to support the dialogue between scientists and statisticians in order to improve the design, conduct and analysis of laboratory studies in biomedical sciences in order to reduce irreproducibility (Masca et al., 2015). The aim of this manuscript, which evolved directly from a number of recommendations made by the RIPOSTE framework, is to help laboratory scientists identify potential UoA issues, to understand the problems an incorrect analysis may cause and to provide practical guidance on how to undertake a valid analysis using the open source R statistical software (R Core Team, 2016; Ihaka and Gentleman, 1996). A simple introduction to the basics of R is available from Venables et al., 2017 and sources of information on implementation of statistical methods in the biosciences are widely available (see, for example, Aho, 2014). A simulation study is undertaken in order to quantify losses in efficiency and inflation of the false positive rate that an incorrect analysis may cause (Appendix 1). The principles of experimental design are briefly discussed, with some general guidance on implemtation and good practice (Appendix 2), and two example datasets are introduced as a means to highlight a number of key issues that are widely misunderstood within the biomedical science literature. Code in the R programming language is provided both as a template for those wishing to undertake similar analyses and in order that all results here can be replicated (Appendix 3); script is available at Parsons, 2017. In addition, a formal mathematical presentation of the most common analysis error in this setting is also provided (Appendix 4).

Methods and materials

Background

A fundamental aspect of the design of all experimental studies is a clear identification of the experimental unit. By definition, this is the smallest object or material that can be randomly and independently assigned to a particular treatment or intervention in the experiment (Mead et al., 2012). The experimental unit is usually the unit of statistical analysis and should provide information on the study outcomes independent of the other experimental units. Where here the term outcome refers to a quantity or characteristic measured or observed for an individual unit in an experiment; most experiments will have many outcomes (e.g. expression of multiple genes, or mutiple assays) for each unit. The term multiple outcomes refers to such situtations, but is not the same as repeated outcomes (or more often repeated measures) which refers to measuring the same outcome at multiple time-points. Experimental designs are generally improved by increasing the number of (independent) experimental units, rather than increasing the number of observations within the unit beyond what is require to measure within unit variation with reasonable precision. If only a single observation of a laboratory test is obtained for each subject, data can be analysed using conventional statistical methods provided all the usual cautions and necessary assumptions are met. However, if there are for instance multiple observations of a laboratory test observed for each subject (e.g. due to multiple testing, duplicated analyses of samples or other laboratory processes) then the analysis must properly take account of this. If all observations are treated equally in an analysis, ignoring the dependency in the data that arises from multiple observations from each sample, this leads to inflation of the false positive (type I error) rate and incorrect (often highly inflated) estimates of statistical power, resulting in invalid statistical inference (see Appendix 1). Errors due to incorrect identification of the experimental unit were identified as an issue of concern in clinical medicine more than 20 years ago, and continue to be so (Altman and Bland, 1997). The majority of such UoA issues involve multiple counting of measurements from individual subjects (experimental units); these issues have particular traction in for instance orthopaedics, ophthalmics and dentistry, where they typically result from measurements on right and left hips, knees or eyes of a study participant or a series of measurements on many teeth from the same person. The drive to improve standards of reporting and thereby design and analysis of randomized clinical trials, which resulted in the widely known CONSORT guidelines (CONSORT GROUP (Consolidated Standards of Reporting Trials) et al., 2001), has now expanded to cover many related areas of biomedical research activity. For instance, work by (Kilkenny et al., 2009) highlighted poor standards of reporting of experiments using animals, and made specific mention of the poor reporting of the number of experimental units; this work led directly to the ARRIVE guidelines (Animal Research: Reporting of In Vivo Experiments; Kilkenny et al., 2010) that explicitly require authors to report the study experimental unit when describing the design. The recent Academy of Medical Sciences symposium on the reproducibility and reliability of biomedical research (Academy of Medical Sciences, 2017) specifically highlighted poor experimental design and inappropriate analysis as key problem areas, and highlighted the need for additional resources such as the NC3Rs (National Centre for the Replacement, Reduction and Refinement of Animals in Research) free online experimental design assistant (NC3Rs, 2017).

Design

The experimental unit should always be identified and taken into account when designing a research study. If a study is assessing the effect of an intervention delivered to groups rather than individuals then the design must address the issue of clustering; this is common in many health studies where a number of subjects may receive an intervention in a group setting or in animal experiments where a group of animals in a controlled environment may be regarded as a cluster. This is also the case if a study is designed to take repeated measurements from individual subjects or units, from a source sample or replicate analyses of a sample itself. Individuals in a study may also be subject to inherent clustering (e.g. family membership) which needs to be identified and accounted for. As a prelude to discussion of analysis issues, it is important to distinguish between a number of widely reported and distinct types of data resulting from a variety of experimental designs. The word subject is used here loosely to mean the subject under study in an experiment and need not necessarily be an individual person, participant or animal. Individual subjects: In many studies the UoA will naturally be an individual subject, and be synonymous with the experimental unit. A single measurement is available for each subject, and inferences from studies comprising groups of subjects apply to the wider population to which the individual subject belongs. For example, a blood sample is collected from patients (experimental units) and a haemoglobin assay is undertaken for each sample. Statistical analysis compares haemoglobin levels between groups of patients, where the variability between samples is used to assess the significance of differences in means between groups of patients. Groups of subjects: Measurements are available for subjects. However, rather than being an individual subject, the experimental unit could be a group of subjects that are exposed to a treatment or intervention. In this case, inferences from analyses of variation between experimental units, apply to the groups, but not necessarily to individual subjects within the groups. For example, suppose actively growing maize plants are planted together at high density in groups of size in controlled growing environments (growth rooms) of varying size and conditions (e.g. light and temperature). Chlorophyll fluorescence is used to measure stress for individual plants after two weeks of growth. Due to the expected strong competition between plants, inferences about the effects of the environmental interventions on growth are made at the room level only. Alternatively, in a different experiment the same plants are divided between growth rooms, kept spatially separated in notionally exactly equivalent conditions, after being previously given one of two different high strength foliar fertiliser treatments. Changes in plant height (from baseline) are used to assess the effect of the foliar interventions on individual plants. Although the intention was to keep growth rooms as similar as possible, inevitably room-effects meant that outcomes for individual plants tended to be more similar if they came from the same room, than if they came from different rooms. In this setting the plant is the experimental unit, but account needs to be made for the room-effects in the analysis. Multiple measurements from a single source sample: In laboratory studies, the experimental unit is often a sample from a subject or animal, which is perhaps treated and multiple measurements taken. Statistical inferences from analyses of data from such samples should apply to the individual tissue (source) from which the sample was taken, as this is the experimental unit. For example, consider the haemoglobin example (i), if the assay is repeated times for each of the blood samples, then there would be data values available for analysis. The analysis should take account of the fact that the replicate measurements made for each sample tell us nothing useful about the variability between samples, which are the experimental units. Multiple sub-samples from a single sample: Often a single sample from an experimental unit is sub-divided and results of assays or tests of these sub-samples yield data that provide an assessment of the variability between sub-samples. It is important to note that this is not the same as taking multiple samples from an experimental unit. The variability between experimental units is not the same as, and must be distinguished from, variability within an experimental unit and this must be reflected in the analysis of data from such studies. For example, samples of cancerous tissue (experimental unit) are each divided into sub-samples and lymph node assays made for each. The variability between the sub-samples, for each of the experimental units, is not necessarily the same as the variability that might have been evident if more than one tissue sample had been taken from each experimental unit. This could be due to real differences as the multiple samples are from different sources, or batch-effects due to how the samples are processed or treated before testing. Repeated measures: One of the most important types of experimental design is the so-called repeated-measures design, in which measurements are taken on the same experimental unit at a number of time-points (e.g. on the same animal or tissue sample after treatment, at more than one occasion). These multiple measurements in time are generally assumed to be correlated and regarded as repeat measurements from an experimental unit and not separate experimental units. The likely autocorrelation between temporally related measurements from the experimental units should be reflected in the analysis of such studies. For example, height measurements for the plants in (ii) could have been made at each of occasions. The height measurements are a useful means of assessing temporal changes for individual plants (experimental unit), such as the rate of increase (e.g. per day). However, due to the likely strong correlations, increasing the number of assessment occasions will generally add much less information to the analysis than would be obtained by increasing the number of experimental units. Clearly many of these distinct design types can be combined to create more complex settings; e.g. plants might be housed together in batches that cause responses from the plants in the same batch to be correlated (batch-effects), and samples taken from the plants, divided into sub-samples, and processed at two different testing centres, possibly resulting in additional centre-effects. For such complex designs, it is advisable to seek expert statistical advice, however the focus in the sections discussing analysis is mainly on cases (ii), (iii) and (iv). Case (i) is handled adequately by conventional statistical analysis, and although case (v) is important, it is too large a topic to discuss in great depth here (see e.g. (Diggle et al., 2013) for a wide ranging discussion of longitudinal data analysis). More general design issues are discussed in Appendix 2.

Sample size

Power analysis provides a formal statistical assessment of sample size requirements for many common experimental designs; power here is the probability (usually expressed as a percentage) that the chosen test correctly rejects the study null hypothesis, and is usually set at either 80% or 90%. Many simple analytic expressions exist for calculating sample sizes for common types of design, particular for clinical settings where methods are well developed and widely used (Chow et al., 2008). Power increases as the square root of the sample size , so power is gained by increasing but at a diminishing rate with . Also power is inversely related to the variance of the outcome , so choosing a better or more stable outcome or assay or test procedure will increase power. For the most simple design with a normally distributed outcome, comparing two groups of subjects (e.g. as in Design case (i)), the sample size is given by , where is the difference we wish to detect, represents the the upper standard normal centile, and is the power and the significance level; for the standard significance of 5% and power of 90%, . Where there are clusters of subjects (e.g. as in Design case (ii)), then the correlation between observations within clusters will have an impact on the sample size (Hemming et al., 2011). The conventional sample size expression needs to be inflated by a variance inflation factor (VIF), also called a design effect, given by , where there are observations in each cluster (e.g. a batch) and ICC is the intraclass (within cluster) correlation coefficient that quantifies the strength of association between subjects within a cluster. The ICC can either be estimated from pilot data or from previous studies in the same area (see examples), or otherwise a value must be assumed. For small cluster sizes () and intraclass correlations (), the sample size needs only to be inflated by typically less than 10% (see Table 1). However for larger values of both and ICC, sample sizes may need to be doubled, trebled or more to achieve the required power.

Table 1.

Variance inflation factors for cluster sizes () 2, 5, 10 and 20, and intraclass correlation coefficients (ICC) 0.01, 0.05, 0.1 and 0.5.

m	ICC
	0.01	0.05	0.1	0.5
2	1.01	1.05	1.10	1.50
5	1.04	1.20	1.40	3.00
10	1.09	1.45	1.90	5.50
20	1.19	1.95	2.90	10.50

For more complex settings, often the only realistic option for sample size estimation is simulation. Raw data values are created from an assumed distribution (e.g. multivariate normal distribution with known means and covariances) using a random number generator, and the planned analysis performed on these data. This process can be repeated many (usually thousands of) times and the design characteristics (e.g. power and type I error rate) calculated for various sample sizes. This has typically been a task that requires expert statistical input, but increasingly code is available in R to make this much easier (Green and MacLeod, 2016; Johnson et al., 2015). Many application area dependent rules of thumb exist when selecting a sample size, the most general being the resource equation approach of (Mead et al., 2012), which suggests that approximately 15 degrees of freedom are required to estimate the error variance at each level of an analysis.

Analysis

Incorrect analysis of data that have known or expected dependencies leads to inflation of the false positive rate (type I error rate) and invalid estimates of statistical power, leading to incorrect statistical inference; a simulation study (Appendix 1) shows how various design characteristics can affect the properties of a hypothetical study. Focussing on linear statistical modelling (McCullagh and Nelder, 1998), which is by far the most widely used methodology for analysis when reporting research in the biomedical sciences, there are generally two distinct approaches to analysis when there are known UoA issues (Altman and Bland, 1997).

Subject-based analysis

The simplest approach to analysis is to use a single observation for each subject. This could be achieved by selecting a single representative observation or more usually by calculating a summary measure for each subject. The summary measure is often the mean value, but could be for instance the area under a response curve or the gradient (rate) measure from a linear model. Given that this results in a single observation for each subject, analysis can proceed using the summary measure data in the conventional way using a generalized linear model (GLM; (McCullagh and Nelder, 1998)) assuming independence between all observations. A GLM relates a (link function) transformed response variable to a linear combination of explanatory variables via a number of model parameters that are estimated from the observed data. The explanatory variables are so-called fixed-effects that represent the (systematic) observed data that are used to model the response variable. The lack of model fit is called the residual or error, and represents unstructured deviations from the model predictions that are beyond control. The subject-based approach is valid but has the disadvantage that not all of the available data are used in the definitive analysis, resulting in some lack of efficiency. Care must be taken when choosing a single measure for each subject, ensuring the selection does not introduce bias and if a summary measure is generated, this value must be meaningful and if appropriate the analysis should be weighted to account for the precision in estimation of the summary measure.

Mixed-effect analysis

A better approach than the subject-based analysis, is a mixed-effect analysis (Galwey, 2014; Pinheiro and Bates, 2000). A (generalized) linear mixed effects model (GLME) is an extension of the conventional GLM, where structure is added to the error term, leaving the systematic fixed terms unchanged, by adding so-called random-effect terms that partition the error term into a set of structured (often nested) terms. In the simplest possible setting (Bouwmeester et al., 2013), the error term is replaced by a subject-error term to model the variation between subjects and a within-subject error term to model the within subject variation. This partition of the error into multiple strata allows, for instance, the correct variability (subject-error term) to be used to compare groups of subjects. Random-effects are often thought of as terms that are not of direct inferential interest (in contrast to the fixed-effects) but are such that they need to be properly accounted for in the model; e.g. a random selection of subjects or centres in a clinical trial, shelves in an incubator that form a temperature gradient or repeat assays from a tissue sample. The algorithms used to estimate the model terms for a GLME and details of how to model complex error structures will not be discussed further, but more details can be found in for instance Pinheiro and Bates, 2000. Mixed-effects models can be fitted in most statistical software packages, but the focus here is on the R open source statistical software (R Core Team, 2016). Detailed examples of implementation and code are provided in Appendix 3 and a script is available at Parsons, 2017 to reproduce all the analysis shown here using the R packages nlme (Pinheiro et al., 2016) and lme4 (Bates et al., 2015).

Results

In order to better appreciate the importance of UoA issues, to understand how these issues arise and to show statistically how analyses should be implemented, two example datasets from real experiments are described and analysed in some detail. The aims of the experiments are clearly not of direct importance, but the logic, process and conduct of the analyses are intended to be sufficiently general in nature so as to elucidate many key problematic issues.

Example 1: Adjuvant radiotherapy and lymph node size in colorectal cancer

Six subjects diagnosed with colorectal cancer, after confirmatory magnetic resonance imaging, underwent neoadjuvant therapy comprising of a short course of radiotherapy (RT) over one week prior to resection surgery. These subjects were compared with six additional cancer subjects, of similar age and disease severity, who did not receive the adjuvant therapy. The aim of the study was to assess whether the therapy reduced lymph node size in the resection specimen (i.e. the sample removed during surgery). The resection specimen for each subject was divided into two sub-samples after collection, and each was fixed in formalin for 48-72 hr. These sub-samples were processed and analysed at two occasions, by different members of the laboratory team. The samples were sliced at 5mm intervals and images captured and analysed in an automated process that identified lymph node material which was measured by a specialist pathologist to give a measure of individual lymph node size (i.e. diameter), based on assumed sphericity. Three slices per sub-sample were collected for each subject. Table 2 shows the measured lymph node sizes in mm for each sample.

Table 2.

Lymph node sizes (mm), by sample slice and subject, by radiotherapy (RT) group, subjects 1 to 6 no RT and subjects 7 to 12 short RT; highlighted cells are those removed to unbalance the design.

None					Short RT
Subject	Sample	Slice			Subject	Sample	Slice
		1	2	3			1	2	3
1	1	1.71	1.98	1.88	7	1	2.37	2.36	2.20
	2	1.72	1.98	1.85		2	2.36	2.62	2.60
2	1	2.51	2.55	2.65	8	1	1.33	1.35	1.15
	2	2.98	3.20	2.80		2	1.90	1.87	1.85
3	1	1.69	1.72	1.80	9	1	1.70	1.78	1.78
	2	1.82	1.97	1.73		2	2.07	1.76	1.85
4	1	1.72	1.78	2.04	10	1	2.23	2.14	2.21
	2	2.50	2.65	2.77		2	2.50	2.33	2.16
5	1	3.32	3.27	3.07	11	1	2.10	1.89	1.75
	2	3.11	3.03	3.11		2	2.11	2.16	2.12
6	1	2.33	2.48	2.53	12	1	2.58	2.54	2.59
	2	2.86	2.87	2.52		2	2.77	2.65	2.60

Naive analysis

The simplest analysis and the one that may appear to be correct if no information on the design or data structure shown in Table 2 were known, would be a t-test that compares the mean lymph node size between the RT groups. This shows that there is reasonable evidence to support a statistically significant difference in mean lymph node size between those subjects who received RT (Short RT) and those who did not (None); mean in group None = 2.403 mm and in group RT Short = 2.120 mm, difference in means = 0.283 mm (95% CI; 0.057 to 0.508), with a t-statistic = 2.501 on 70 degrees of freedom, and a p-value = 0.015. The conclusion from this analysis is that lymph node sizes were statistically significantly smaller in the group that had received adjuvant RT. Why should the veracity of this result be questioned? The assumptions made when undertaking any statistical analysis must be considered carefully. The t-statistic is calculated as the absolute value of the difference between the group means, divided by the pooled standard error of the difference (sed) between the group means. This latter quantity is given by , where and are the sample sizes in the two groups and is the pooled variance given by ; where and are the variances within each group. The important thing to realize here is that the variances within each of the RT groups are calculated by simply taking the totality of data for all six subjects in each group, across all sample types and slices. One of the key assumptions of the t-test is that of independence. Specifically, this requires the lymph node sizes to be all independent of each other; i.e. the observed size for one particular node is not systematically related to the other lymph node size data used for the statistical test. What is meant by related to in this context? It seems highly likely that the lymph node sizes for repeat slices for any particular sample for a subject are more similar than size measurements from other subjects. Similarly, it might be expected that lymph node sizes for the two samples for each subject are more similar than lymph nodes size measurements from other subjects. If the possibility that this is important is ignored, and a t-test is undertaken, then the variability measured between samples and between slices within samples is being used to assess differences between subjects. If the assumption of independence is not valid, then by ignoring this, claims for statistical significance may be being made that are not supported by the data (See Appendix 4 for a mathematical description of the naive analysis). Given that the lymph node size measurements within samples and subjects are likely to be more similar to each other than to data from other subjects, how should the analysis be conducted? Visual inspection of the data can often reveal patterns that are not apparent from tabular summaries; Figure 1 shows a strip plot of the data from Table 2.

Figure 1.

A strip plot showing observed lymph node size data by subject (1-12) and sample, after none and a short course of radiotherapy (Short RT).

It is clear, from a visual inspection alone of Figure 1, that data from repeat slices within samples are more similar (clustered together) than data from the repeat samples within each subject. And also that data from the multiple samples and slices for each subject are generally clustered together; data from a single subject are usually very different from other subjects, irrespective of the RT grouping. One, albeit crude, solution to such issues is to calculate a summary measure for each of the experimental units at the level at which the analysis is made, and use these measures for further analysis. The motivation for doing this is that it is usually reasonable to assume that experimental units (subjects) are independent of one another, so if a t-test is undertaken on summary measures from each of the twelve subjects it is also reasonable to assume that the necessary assumption of independence is true. Using the mean lymph node size for each subject as the summary measure (subjects 1 to 12; 1.85, 2.78, 1.79, 2.24, 3.15, 2.60, 2.42, 1.57, 1.82, 2.26, 2.02, and 2.62 mm), a t-test shows that there is no evidence to support a statistically significant difference in mean lymph node size between those subjects who received RT (Short RT) and those who did not (None); mean in group None = 2.403 mm and in group RT Short = 2.120 mm, difference in means = 0.283 mm (95% CI; -0.321 to 0.886), with a t-statistic = 1.043 on 10 degrees of freedom, and a p-value = 0.322. Note that the group means are the same but now the t-statistic is based on 10 degrees of freedom, rather than the 70 of the naive analysis, and the confidence interval is considerably wider than that estimated for the naive analysis. The conclusion from this analysis is that there is no evidence to support a difference in lymph node size between groups. Why is the result of this t-test so different from the previous naive analysis? In the naive analysis the variability between measurements within the main experimental units (subjects) and the variability between experimental units was used to assess the difference between experimental units. In the analysis in this section the variability between experimental units alone has been used to assess the effect of the intervention applied to the experimental units. The multiple measurements within each experimental unit improve the precision of the estimate of the unit mean, but provide no information on the variability between units, that is important in assessing interventions that are applied to the experimental units. This analysis is clearly an improvement on the naive analysis, but it uses only summary measures for each experimental unit, rather than the full data, it tells us nothing about the relative importance of the variability between subjects, between samples and between slices and it does not allow us to assess the importance of these design factors to the conclusions of the analysis.

Linear mixed-effects analysis

To correctly explain and model the lymph node data a linear mixed-effects model must be used. The experimental design used in the lymph node study provides the information needed to construct the random-effects for the mixed-effects model. Here there are multiple levels within the design that are naturally nested within each other; samples are nested within subjects, and slices are nested within samples. Fitting such a mixed-effects model gives the following estimate for the intervention effect (RT treatment groups); difference in means = 0.283 mm (95% CI; -0.321 to 0.886), with a p-value = 0.322 (t-statistic = 1.043 on 10 degrees of freedom). For a balanced design, intervention effect estimates for the mixed-effects model are equivalent to those from the subject-based analysis. A balanced design is one where there are equal numbers of observations for all possible combinations of design factor levels; in this example there are the same number of slices within samples and samples within subjects. The mixed effects model allows the variability within the data to be examined explicitly. Output from model fitting also provides estimates of the standard deviations of the random effects for each level of the design; these are for subjects, = 0.436 (95% CI; 0.262 to 0.727), samples = 0.236 (95% CI; 0.151 to 0.362) and residuals (slices) = 0.122 (95% CI; 0.100 to 0.149). Squaring to get variances, indicates that the variability, in lymph node size, between subjects was three and half times more than the variability between samples, and nearly thirteen times as much as the variability between repeat slices within samples. The intraclass correlation coefficient measures the strength of association between units within the same group; for subjects = 0.733, where . This large value, which represents the correlation between two randomly selected observations on the same subject, shows why the independence assumption required for the naive analysis is wrong (i.e. independence implies that ICC = 0). This demonstrates clearly why pooling variability without careful thought about the sampling strategy and design of an experiment is unwise, and likely to lead to erroneous conclusions. Various competing models for random effects can be compared using likelihood ratio tests (LRT). For instance in this example suppose that the two samples collected for the same subject had been arbitrarily labelled as sample 1 and sample 2, and in practice there was no real difference in the methods used to process or capture images of nodes from the two samples. In such a setting, a more appropriate random effects model may be to have a subject effect only and ignore the effects of samples within subjects. Constructing such a model and comparing to the more complex model gives a LRT = 39.92 and p-value < 0.001, providing strong support in favour of the full multilevel model. Diagnostic analyses can be undertaken after fitting mixed-effects model, in an analogous manner to linear models (Fox et al., 2011). Figure 2 shows boxplots of residuals for each subject and a quantile-quantile plot to assess Normality of the residuals. Inspection of the residual plots for the lymph node size data, show that assumptions of approximate Normality are reasonable; e.g. the quantile-quantile plot of the residuals from the model fit fall (approximately) along a straight line when plotted against theoretical residuals from a Normal distribution. If residuals fail to be so well behaved and deviate in a number of well understood ways, or if for instance variances are non-equal or vary with the outcome (heterogeneity), then transforming the data prior to linear mixed-effects analysis can improve the situation (Mangiafico, 2017). However, in general, if the Normality assumption is not sustainable, data are better analysed using generalized linear mixed effects models (Pinheiro and Bates, 2000; Galwey, 2014), that better account for the distributional properties of the data.

Figure 2.

Boxplots of residuals (observed values - fitted values) for each subject; symbols () are medians, boxes are interquartile ranges (IQR), whiskers extend to 1.5IQR and symbols () outside these are suspected outliers (a).

Quantile-quantile (Q–Q) plot of the model residuals () on the horizontal axis against theoretical residuals from a Normal distribution on the vertical axis (b).

Boxplots of residuals (observed values - fitted values) for each subject; symbols () are medians, boxes are interquartile ranges (IQR), whiskers extend to 1.5IQR and symbols () outside these are suspected outliers (a).

Quantile-quantile (Q–Q) plot of the model residuals () on the horizontal axis against theoretical residuals from a Normal distribution on the vertical axis (b).

Unbalanced data analysis

Intervention effect estimates for the mixed-effects and subject-based analyses presented here are equivalent, due to the balanced nature of the design. Every subject has complete data for all samples and slices. By calculating means for each subject averaging occurs across the same mix of samples and slices, so irrespective of the effects on the analysis of these factors, the means will be directly comparable and estimated with equivalent precision. Whilst balance is a desirable property of any experimental design, it is often unrealistic and impractical to obtain data structured in this way; for instance in this example, samples may be contaminated or damaged during processing or insufficient material may be available for all three slices. Repeating the above mixed-effects analysis after randomly removing 50% of the data (see Table 2), gives an estimated difference in lymph node size between groups = 0.263 mm (95% CI; -0.397 to 0.922), with a p-value = 0.391, and estimates of the standard deviations of the random effects for each level of the design, (95% CI; 0.224 to 0.794), (95% CI; 0.160 to 0.489) and (95% CI; 0.088 to 0.174). These are, perhaps surprisingly given that only half the data from the previous analysis are being used, very similar to estimates from the complete data. However, in the unbalanced setting the subject-based analysis is no longer valid, as it ignores the variation in sample sizes between subjects; the estimated difference in lymph node size between groups is 0.199 mm (95% CI; -0.474 to 0.872) for the subject-based analysis.

Example 2: Lymph node counts after random sampling

The most extreme example of non-normal data is for binary responses, which generally results from yes/no or present/absence type outcomes. Extending the lymph node example, in a parallel study, rather than measure the sizes of selected nodes or conduct a time-consuming count of all nodes, a random sampling strategy was used to select regions of interest (RoI) in which fives nodes were randomly selected and compared to a 2mm reference standard (2mm; yes or no). This could be done rapidly by a non-specialist. Five samples were processed for each of twelve subjects, in an equivalent design to the lymph node size study; data are shown in Table 3.

Table 3.

Number of five selected lymph nodes with maximum diameters 2mm, for up to five tissue samples per subject (1-12), after either none or a short course of radiotherapy (Short RT).

None						Short RT
Subject	Sample					Subject	Sample
	1	2	3	4	5		1	2	3	4	5
1	4	4	-	-	-	7	1	0	0	0	0
2	3	4	5	2	-	8	1	2	-	-	-
3	2	3	3	2	-	9	1	0	1	0	2
4	2	4	1	2	1	10	2	1	4	0	2
5	3	4	4	3	5	11	4	2	4	3	3
6	2	5	5	3	3	12	3	4	3	-	-

Non-normal data analysis

For some subjects there was insufficient tissue for five samples, resulting in an unbalanced design. The odds of an event (i.e. observing or not observing a lymph node with diameter 2mm), is the ratio of the probabilities of the two possible states of the binary event, and the odds ratio is the ratio of the odds in the two groups of subjects (e.g. those receiving either None or Short RT). A naive analysis of these data suggest an estimate of the odds ratio of (43/82)/(79/46) = 0.31, for RT Short versus None groups; 43 lymph nodes with maximum diameters 2mm from 125 in the RT Short group versus 79 from 125 in the None group. Being in the RT Short group results in a lower odds of lymph nodes with diameters 2mm. This is the result one would obtain by conventional logistic regression analysis; odds-ratio 0.31 (95% CI; 0.18 to 0.51; p-value < 0.001) providing very strong evidence that lymph node diameters were lower in the RT Short group. In logistic regression analysis the estimated regression coefficients are interpreted as log odds-ratios, which can be transformed to odds ratios using the exponential function (Hosmer et al., 2013). However, one should be instinctively cautious about this result, as it is clear from Table 3 that variation within subjects is much less than between subjects; i.e. some subjects have low counts across all samples and others have high counts across all samples. The above analysis ignores this fact and pools variation between samples and between subjects to test for differences between two groups of subjects. This is clearly not a good idea. Fitting a GLME model with a subject random effect, gives an estimated odds-ratio for the Short RT group of 0.26 (95% CI; 0.09 to 0.78; p-value = 0.016). The predicted probability of detecting a lymph node with a diameter 2mm was 0.65 for the None RT group and 0.33 for the Short RT. The overall conclusions of the study have not changed, however the level of significance associated with the result is massively overstated in the simple logistic regression, due to the much smaller estimate of the standard error of the log odds-ratio (0.264 for logistic regression versus 0.564 for the mixed-effects logistic regression). By failing to properly account for the difference in variability between measurements made on the same subject relative to the variability in measurements between subjects results in overoptimistic conclusions.

Discussion

The examples, simulations and code provided highlight the importance of correctly identifying the UoA in a study, and show the impact on the study inferences of selecting an inappropriate analysis. The simulation study (Appendix 1) shows that the false positive rate can be extremely high and efficiency very low if analyses are undertaken that do not respect well known statistical principles. The examples reported are typical of studies in the biomedical sciences and together with the code provide a resource for scientists who may wish to undertake such analyses (Appendix 3). Although clearly discussion with a statistician, at the earliest possible stage in a study, should always be strongly encouraged, in practice this may not be possible if statisticians are not an integral part of the research team. The RIPOSTE framework (Masca et al., 2015) called for the prospective registration (Altman, 2014) and publication of study protocols for laboratory studies, which we believe if implemented would go a long way towards addressing many of the issues discussed here by causing increased scrutiny at all stages of an experimental study. The examples, design and analysis methods presented here have deliberately used terminology such as experimental unit, subject and sample to make the arguments more comprehensible, particularly for non-statisticians, who often find these topics conceptually much easier to understand using such language. This may have contributed to the widespread belief amongst many laboratory scientists that these issues are important only in human experimentation. Where, for instance, the subject is a participant in a clinical trial and the idea that subjects provide data that are independent of one another, but correlated within a subject seems perfectly natural. However, although such language is used here, it is important to emphasise that the issues discussed apply to all experimental studies and are arguably likely to be more not less important for laboratory studies than for human studies. The lack of appreciation of the importance of UoA issues in laboratory science may be due to the misconception that the within subject associations observed for human subjects arise mainly from the subjective nature of the measures used in clinical trials on human subjects; e.g. patient-reported outcomes. Contrasting these with the more objective (hard) measures that dominate in much biomedical laboratory based science leads many to assume that that these issues are not important when analysing data and reporting studies in their own research area. Mixed-effects models are now routinely used in the medical and social sciences (where they are often known as multilevel models), to for instance allow for the clustering in patient data from a recruiting centre in a clinical trial, or to model the association in outcomes within schools and classrooms from students (Brown and Prescott, 2015; Snijders and Bosker, 2012). Mixed-effects models originated from the work of pioneering statistician/geneticist R. A. Fisher (Fisher, 1919), whose classic texts on experimental design have led to their extensive and very early use in agricultural field experimentation (Mead et al., 2012). However, the use of mixed-effects models in the biological sciences has not spread from the field to the laboratory. Mixed-effects models are not used as widely in biomedical laboratory studies as in many other scientific disciplines, which is a concern, as given the nature of the experimental work reported one would expect these models to be equally widely used and reported as they are elsewhere. This is most likley simply a matter of lack of knowledge and convention; if colleagues or peers do not routinely use these methods then why should I? By highlighting the issue and providing some guidance the hope is that this article may address the first of these issues. Journals and other interest groups (e.g. funding bodies and learned societies) have a part to play also, particularly in ensuring that work is reviewed by experienced and properly qualified statisticians at all stages from application to publication (Masca et al., 2015). In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included. Thank you for submitting your article "Unit of analysis issues continue to be a cause of concern in reporting of laboratory-based research" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by Mark Jit as the Reviewing Editor and Peter Rodgers as the eLife Features Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Jenny Barrett (Reviewer #2); Chris Jones (Reviewer #3). The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission. The reviewers and editors were in agreement on the value of the concept and approach of the manuscript. There were a large number of issues that we felt needed to be addressed, but we do not believe that any of them will take a long time to complete. Summary: The tutorial describes issues related to non-independence in data from laboratory and other experiments and further show how they may be overcome, both in a simple way (using subject-level averages) and a more comprehensive way (using mixed models). This is a common problem, and the paper does a good job of both explaining it and giving researchers the tools to deal with it. Its utility is greatly enhanced by very clear detailed illustrative examples and R code to carry out the analyses discussed. Title: The current title indicates that the paper is going to show that "Unit of analysis issues continue to be a cause of concern in reporting of laboratory-based research", but that is not what the paper does. Rather, the paper provides guidelines on how to understand the concept of "Unit of analysis" and analyse experiments appropriately. The title should be changed to reflect this. Essential revisions: Currently the article contains no guidance on sample size calculation for either the "simple" analyses or the more complex analyses. Nor does it contain any guidance on minimal sample size for the modelling methods suggested. Some comments on sample size and power would be valuable as these are issues that are often neglected by lab scientists. It would also be useful for anyone considering more complex analyses to have an idea of the minimum sample size that can realistically be used to fit the models. Subsection “Design”. Different designs. Please include some examples of experiments for each situation, as this would make it easier for lab scientists to recognise their type of sample in this list. The example of groups of subjects seems to refer to situations where interest is in the group itself. A common situation instead is where interest is on the effect of treatment on an individual (the experimental unit), but the individuals happen to be grouped (correlated), and it could be useful to clarify this distinction. For example, in laboratory studies the samples may have been analysed in different batches. Appendix 2 in its current form may not be very helpful or informative to the majority of readers. It does not really explain how to choose among alternative designs, and the equations are likely to be forbidding to non-statisticians. While there are no space limitations in eLife, it should be rewritten to focus on the design issues: when should you get more measurements per subject, vs. more subjects? What good are such within-subject replicates (e.g. small improvements in precision, but particularly to be able to measure assay error). It would also benefit from a box summarising what it is showing in a couple of simple sentences, so people that can't get through the equations can at least understand the point it is making. The code in Appendix 3 is very helpful, but it is difficult to read in its present form. We recommend publishing it in text form using indentation, colours, and explanatory text interspersed with the sections of code to explain it. Ideally, it should be written as a tutorial (with portions of text and code interspersed). It would also be good to show how the data for each of the examples is structured within a database – i.e. with variables representing the individual, clustering, groupings etc. Lab scientists are generally less familiar with how data is entered/stored in databases/stats software, and they may be familiar with GraphPad Prism, which accepts data in very different formats to the standard format required for the analyses presented in this paper. Appendix 3 could be expanded to include the data frames next to the R code (at the start of each example). Title: The current title indicates that the paper is going to show that "Unit of analysis issues continue to be a cause of concern in reporting of laboratory-based research", but that is not what the paper does. Rather, the paper provides guidelines on how to understand the concept of "Unit of analysis" and analyse experiments appropriately. The title should be changed to reflect this. Title changed to “Unit of analysis issues in laboratory based research: a review of concepts and guidance on study design and reporting”. Essential revisions: Currently the article contains no guidance on sample size calculation for either the "simple" analyses or the more complex analyses. Nor does it contain any guidance on minimal sample size for the modelling methods suggested. Some comments on sample size and power would be valuable as these are issues that are often neglected by lab scientists. It would also be useful for anyone considering more complex analyses to have an idea of the minimum sample size that can realistically be used to fit the models. A new subsection has been added, after the ‘Analysis’ subsection, that discusses sample size estimation from initially a very simple design, to more complex GLMMs via simulation. Subsection “Design”. Different designs. Please include some examples of experiments for each situation, as this would make it easier for lab scientists to recognise their type of sample in this list. The example of groups of subjects seems to refer to situations where interest is in the group itself. A common situation instead is where interest is on the effect of treatment on an individual (the experimental unit), but the individuals happen to be grouped (correlated), and it could be useful to clarify this distinction. For example, in laboratory studies the samples may have been analysed in different batches. Simple examples have been added to the design types in the subsection “Design”. The ‘Groups of subjects’ example has been expanded to cover the kind of ‘batch-effects’ identified by the reviewer. Appendix 2 in its current form may not be very helpful or informative to the majority of readers. It does not really explain how to choose among alternative designs, and the equations are likely to be forbidding to non-statisticians. While there are no space limitations in eLife, it should be rewritten to focus on the design issues: when should you get more measurements per subject, vs. more subjects? What good are such within-subject replicates (e.g. small improvements in precision, but particularly to be able to measure assay error). It would also benefit from a box summarising what it is showing in a couple of simple sentences, so people that can't get through the equations can at least understand the point it is making. Appendix 2 has been modified to discuss fundamental design issues for a putative example experiment. It now focuses more on design issues, and uses less mathematical language that should be more accessible to readers of eLife. The mathematical details of the (incorrect) naïve analysis has been moved to a separate new appendix (Appendix 4). The code in Appendix 3 is very helpful, but it is difficult to read in its present form. We recommend publishing it in text form using indentation, colours, and explanatory text interspersed with the sections of code to explain it. Ideally, it should be written as a tutorial (with portions of text and code interspersed). Appendix 3 (R code for examples) has been completely revised and re-written along the lines suggested here. It is now written in the style of a tutorial with code indented and coloured to distinguish it from the main text. R output is also now provided to help those wishing to check exactly what would be produced if the code were pasted directly into R. It would also be good to show how the data for each of the examples is structured within a database – i.e. with variables representing the individual, clustering, groupings etc. Lab scientists are generally less familiar with how data is entered/stored in databases/stats software, and they may be familiar with GraphPad Prism, which accepts data in very different formats to the standard format required for the analyses presented in this paper. Appendix 3 could be expanded to include the data frames next to the R code (at the start of each example). We agree that the data entry in the previous example R code was not realistic. Appendix 3 now explicitly shows the format of the data in R. A note is also added to explain how data would normally be entered using the read statement that will import data into R from standard spreadsheets or databases.

18 in total

Review 1. Statistics notes. Units of analysis.

Authors: D G Altman; J M Bland
Journal: BMJ Date: 1997-06-28

Review 2. Addressing the unit of analysis in medical care studies: a systematic review.

Authors: Aaron W Calhoun; Gordon H Guyatt; Michael D Cabana; Downing Lu; David A Turner; Stacey Valentine; Adrienne G Randolph
Journal: Med Care Date: 2008-06 Impact factor: 2.983

3. The unit of analysis error in studies about physicians' patient care behavior.

Authors: G W Divine; J T Brown; L M Frazier
Journal: J Gen Intern Med Date: 1992 Nov-Dec Impact factor: 5.128

4. The time has come to register diagnostic and prognostic research.

Authors: Douglas G Altman
Journal: Clin Chem Date: 2014-02-11 Impact factor: 8.327

5. Increasing value and reducing waste in research design, conduct, and analysis.

Authors: John P A Ioannidis; Sander Greenland; Mark A Hlatky; Muin J Khoury; Malcolm R Macleod; David Moher; Kenneth F Schulz; Robert Tibshirani
Journal: Lancet Date: 2014-01-08 Impact factor: 79.321

6. Prediction models for clustered data: comparison of a random intercept and standard regression model.

Authors: Walter Bouwmeester; Jos W R Twisk; Teus H Kappen; Wilton A van Klei; Karel G M Moons; Yvonne Vergouwe
Journal: BMC Med Res Methodol Date: 2013-02-15 Impact factor: 4.615

7. RIPOSTE: a framework for improving the design and analysis of laboratory-based research.

Authors: Nicholas Gd Masca; Elizabeth Ma Hensor; Victoria R Cornelius; Francesca M Buffa; Helen M Marriott; James M Eales; Michael P Messenger; Amy E Anderson; Chris Boot; Catey Bunce; Robert D Goldin; Jessica Harris; Rod F Hinchliffe; Hiba Junaid; Shaun Kingston; Carmen Martin-Ruiz; Christopher P Nelson; Janet Peacock; Paul T Seed; Bethany Shinkins; Karl J Staples; Jamie Toombs; Adam Ka Wright; M Dawn Teare
Journal: Elife Date: 2015-05-07 Impact factor: 8.140

8. Ophthalmic statistics note 1: unit of analysis.

Authors: Catey Bunce; Krishna V Patel; Wen Xing; Nick Freemantle; Caroline J Doré
Journal: Br J Ophthalmol Date: 2013-12-19 Impact factor: 4.638

9. Power analysis for generalized linear mixed models in ecology and evolution.

Authors: Paul C D Johnson; Sarah J E Barry; Heather M Ferguson; Pie Müller
Journal: Methods Ecol Evol Date: 2014-12-06 Impact factor: 7.781

10. The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis?

Authors: Stanley E Lazic
Journal: BMC Neurosci Date: 2010-01-14 Impact factor: 3.288

5 in total

1. Categorical representation from sound and sight in the ventral occipito-temporal cortex of sighted and blind.

Authors: Stefania Mattioni; Mohamed Rezk; Ceren Battal; Roberto Bottini; Karen E Cuculiza Mendoza; Nikolaas N Oosterhof; Olivier Collignon
Journal: Elife Date: 2020-02-28 Impact factor: 8.140

2. Ten common statistical mistakes to watch out for when writing or reviewing a manuscript.

Authors: Tamar R Makin; Jean-Jacques Orban de Xivry
Journal: Elife Date: 2019-10-09 Impact factor: 8.140

3. In vitro longitudinal lumbar spinal cord preparations to study sensory and recurrent motor microcircuits of juvenile mice.

Authors: Mustafa Görkem Özyurt; Julia Ojeda-Alonso; Marco Beato; Filipe Nascimento
Journal: J Neurophysiol Date: 2022-08-10 Impact factor: 2.974

4. Intra-articular injection of triamcinolone acetonide releasing biomaterial microspheres inhibits pain and inflammation in an acute arthritis model.

Authors: Imke Rudnik-Jansen; Karin Schrijver; Nina Woike; Anna Tellegen; Sabine Versteeg; Pieter Emans; George Mihov; Jens Thies; Niels Eijkelkamp; Marianna Tryfonidou; Laura Creemers
Journal: Drug Deliv Date: 2019-12 Impact factor: 6.419

5. Between two stools: preclinical research, reproducibility, and statistical design of experiments.

Authors: Penny S Reynolds
Journal: BMC Res Notes Date: 2022-02-21

5 in total