Literature DB >> 26977131

The effect of simple imputation on inferences about population means when data are missing in biomedical research due to detection limits.

Hongyue Wang¹, Guanqing Chen¹, Xiang Lu¹, Hui Zhang², Changyong Feng³.

Abstract

The sample geometric mean has been widely used in biomedical and psychosocial research to estimate and compare population geometric means. However, due to the detection limit of measurement instruments, the actual value of the measurement is not always observable. A common practice to deal with this problem is to replace missing values by small positive constants and make inferences based on the imputed data. However, no work has been carried out to study the effect of this naïve imputation method on inference. In this report, we show that this simple imputation method may dramatically change the reported outcomes of a study and, thus, make the results uninterpretable, even if the detection limit is very small.

Entities: Chemical Disease Gene Species

Keywords: population geometric mean; sample geometric mean; two-sample test

Year: 2015 PMID： 26977131 PMCID： PMC4764008 DOI： 10.11919/j.issn.1002-0829.215121

Source DB: PubMed Journal: Shanghai Arch Psychiatry ISSN： 1002-0829

Introduction

Detection limit is a long standing problem in experimental sciences. It refers to the limited ability of an instrument in measuring an outcome of interest in a certain range (typically small values close to 0). Many instruments cannot return meaningful measurements if signals fall below a certain threshold value. This problem is especially prevalent in biomedical sciences as signals are sometimes too weak to be detected in the presence of ambient noise. Although detection limits are often due to limitations of physical devices, the problem may also arise in psychosocial research with assessments based on instruments (questionnaires). For example, in alcohol and substance use research, alcohol or drug use may not be detected in a subject if the blood level is not sufficiently high. Also, if answers for all or most subjects to an item in a questionnaire fall below (or above) a certain score in the potential range of scores, the lack of variability in the outcome may prevent any useful analysis of the data. Detection limit presents problems for statistical analysis since no data (or very little data) is observed in part of the potential range of the variable. It is not possible to gauge the variability of the outcome below the detection limit, but this information is needed to conduct standard statistical inference on the data in the whole range (for example, to estimate the geometric mean of the population from which the sample is drawn). A commonly used ad-hoc method to deal with this problem is to impute data below the detection limit and then apply standard statistical methods.[1] This practice is especially prevalent in biomedical research. Geometric means are the most popular method of imputing values below the detection limit because data are often log-transformed to reduce skewness before being analyzed (even though the log-transformation may not actually reduce the skewness[2]). The arithmetic mean of the log-transformed outcome is the logarithm of the sample geometric mean. Although imputation seems natural and intuitive, it has significant implications for statistical inference and, thus, on the reported results of research.[3, 4] In this report we discuss the pitfalls of using this common method of imputation in research and in clinical practice.

Geometric mean of a non-negative random variable

The so-called ‘geometric mean’ used in biomedical and psychosocial research is actually the sample geometric mean, that is, the geometric mean of a sample of observations from an underlying distribution. For example, let a, i=1, ..., n, be a sequence of non-negative numbers; then the geometric mean of the sequence is . This is what is commonly referred to as the ‘geometric mean’, but it is important to keep in mind that the sample and population means are completely different concepts. The former is computable based on the sample, while the latter is an unknown quantity, or a parameter in the nomenclature of statistics. The geometric mean described above is actually a sample geometric mean because it is a computable quantity. Unlike the arithmetic, the population geometric mean had never been clearly defined in the literature until the recent work of Feng and colleagues[2, 3] who presented a formal definition of this elusive quantity. The population geometric mean of a non-negative variable X defined if either X has non-zero probability at zero (that is, X may equal 0) or X is positive with |log X| having a finite mean value, that is, E|log X|<∞ . Subsequently the definition was further broadened to only require that E|log X| exists (which includes E|log X|<∞ as a special case) and the properties of the population geometric mean were elaborated.[4] This work lays a conceptual foundation to interpret the sample geometric mean (as an estimate of the underlying population geometric mean) and clarifies some ambiguities in using the geometric mean in biomedical research. A brief summary of this work is shown in Box 1. The geometric mean has a very unusual property. We know that for a positive random variable, arithmetic mean, if it exists, is always positive. However, for some positive random variables, geometric means can be zero. This fact is counterintuitive as the sample geometric mean obtained from a positive random variable is always positive. This unusual property can have significant implications for inference about the population geometric mean when the data in the sample is left-censored due to a detection limit. Another issue is the relationship between the geometric mean and the arithmetic mean for a positive random variable. In biomedical research, data is often right-skewed with most values close to the lower limit. A popular approach for dealing with this is to log-transform the data, analyze the transformed data, and then transform the result back to the original scale. For a non-negative random variable X, in general there is no connection between the geometric mean GM and the arithmetic mean E(X), even if they both exist. For example, for two log-normally distributed random variables, if they have the same log-mean values but different log-variances, then their geometric means are equal but their arithmetic means are not equal. This means that we cannot test the hypothesis that they have the same (arithmetic) mean values by testing that the log-transformed data have the same (arithmetic) mean values. This fact is not well appreciated in biomedical research.[2] BOX 1. Relationship of population and sample geometric means • Suppose F(x) is the probability distribution function of a non-negative random variable X. If F(0) > 0, then the geometric mean of X, denoted by GM. • Suppose the geometric mean of X exists, and, X=1, ..., n is a random sample from the distribution of X. Let be the sample geometric mean. Then the sample geometric mean is strongly consistent[3, 4] and, thus, is a consistent estimate of the population geometric mean.

Geometric mean with detection limit

In this section, we discuss the effect of the naïve imputation method on the geometric mean in the presence of a detection limit. Let X be a positive random variable and δ be the lower detection limit; X is unobservable (missing) if X < δ. A common approach in biomedical research is to define a modified version of X by: where η is some positive constant. Usually, η=δ/2, or a small positive constant less than δ.[1, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] After the imputation, inference about the population geometric mean of the original data proceeds by treating the imputed data as if they were observed. To discuss potential effects of this naive imputation on inference about the population geometric mean, we assume that X is a positive random variable, which is the case in most real-study applications. a) GM>0. The geometric mean of X* can be greater than, less than, or equal to GM depending on the distribution of X below the detection limit. If the detection limit δ is small enough, then with relatively large sample sizes inferences based on the imputed data (such as confidence intervals, the two-sample t-test, and the paired t-test) yield valid results for the original data. However, if δ is large, imputation may yield invalid results. b) GM=0. With the imputation, the geometric mean of X* depends on how the imputed value η is selected and is always greater than η. This means that the estimated geometric mean based on the imputed data may be very far away from the theoretical geometric mean of zero. Another effect is that the imputation brings some arbitrariness into the statistical inference. Thus whether GM>0 or GM=0, imputation has significant implications for inference about the population geometric mean. If GM>0, inference using common statistical methods is reasonably robust if the detection limit is small; but if GM=0, any analysis of the geometric mean based on the imputed data is invalid and the result is uninterpretable. Unfortunately, the detection limit makes it impossible to determine whether GM>0 or GM=0.

Simulation results

As described above, when the sample geometric mean of a positive random variable is 0, the geometric mean of the modified observation (which imputes values below the detection limit δ) may be very different from 0 and, thus, inferences based on the modified sample may be misleading. Suppose Y has a standard log-normal distribution with its probability distribution function Φ, and U is independent of Y and uniformly distributed on (0, 1). Let C0 be a positive constant. The random variable X is defined as: The distribution function of X is: In the simulation study, C0 is set at 0.277602. The data X1, ….., Xn is generated from the distribution of X defined in equation (1) above.

Properties of geometric mean

Figure 1 shows the cumulative distribution function of X and Y when C0=0.1. Since Φ(0.1)=0.01, it is nearly impossible to distinguish between the two distribution function curves in the figure. However, their geometric means are very different. It is easy to prove that GM=1 and GM=0, no matter how small C0 is (see Example 2 in Feng and colleagues[4] for a proof).

Figure 1. Cumulative distribution functions of X and Y in formula (1) with c0=0.1

Figure 1. Cumulative distribution functions of X and Y in formula (1) with c0=0.1 Since X is positive, the sample geometric mean is always positive. Although the sample geometric mean is a consistent estimator of GM (which is 0 in this case), it may be quite a large number. Table 1 is a sample of n=100 observations from the distribution of Z=100X, where X is defined in equation (1). The sample geometric mean is =85.72. However, the population geometric mean is actually GM=0. It is very difficult to imagine that the data in Table 1 is from a distribution with a geometric mean of 0. This strange property of the geometric mean makes it difficult to test whether or not a sequence of positive numbers is a sample from a distribution with a geometric mean equal to 0.

Table 1: A random sample from a distribution with geometric mean 0 (sample size n=100)

166.95	70.67	75.68	2.61	264.39	55.30	129.93	87.55	172.43	59.95
211.11	127.63	71.91	362.70	73.12	293.65	292.67	369.40	139.59	304.00
155.42	16.80	109.80	18.34	190.47	29.37	53.43	62.93	137.79	44.72
152.19	84.66	172.02	45.94	437.89	110.13	53.51	152.44	75.92	60.48
151.47	513.60	34.72	69.70	492.94	42.03	4.48	82.01	445.03	35.22
2.67	41.08	205.55	73.19	713.21	182.35	43.62	67.32	37.21	65.01
108.44	747.98	15.69	59.55	122.46	475.55	0.95	261.28	96.82	168.29
44.53	191.05	74.81	143.88	194.59	26.63	90.69	141.91	25.92	251.09
55.08	154.57	53.82	66.33	53.58	17.57	115.23	6.69	49.44	303.29
118.96	48.13	39.11	690.46	170.17	217.58	62.74	79.84	26.43	106.79

Table 1: A random sample from a distribution with geometric mean 0 (sample size n=100) In the simulation study, we set the detection limit at δ=0.277602 such that there was a 10% probability that X is below the detection limit, that is, Pr{X<0.277602}=0.1. No data is observed below this detection limit, so if the value of δ/2 is imputed for all cases in which X falls below δ=0.277602, then the modified observations are Let and be the sample means, and let and be the sample geometric means of (X, ..., X) and (X1 *, ..., X *) respectively. Table 2 shows the means and standard deviations of , and for samples of different sizes after 100, 000 Monte Carlo replications. In each replicate a random sample X, ..., X is generated, and and are calculated. For each n, the mean and standard deviation of GMn is the sample mean and sample standard deviation based on 100, 000 Monte Carlo replicates.

Table 2: Means and standard deviations of sample means and sample geometric means

sample size
sample size	mean	sd	mean	sd	mean	sd	mean	sd
10	1.6459	0.6936	1.6485	0.6930	0.9169	0.4460	1.0326	0.3445
50	1.6406	0.3068	1.6433	0.3065	0.7573	0.3079	0.9887	0.1450
100	1.6415	0.2167	1.6442	0.2165	0.7027	0.2797	0.9834	0.1017
500	1.6418	0.0969	1.6445	0.0968	0.5944	0.2389	0.9795	0.0453
1, 000	1.6413	0.0688	1.6440	0.0688	0.5399	0.2464	0.9789	0.0321
5, 000	1.6411	0.0306	1.6438	0.0306	0.3083	0.3056	0.9783	0.0143
10, 000	1.6413	0.0217	1.6440	0.0216	0.1571	0.2651	0.9783	0.0101

Table 2: Means and standard deviations of sample means and sample geometric means The same interpretation applies to the other columns in the table. There are two main findings shown in the table: a) Since the detection limit is relatively small, the difference between the means and is very small. They are very close to E(X1) and E(X1 *) even for small sample sizes. b) The geometric means behave very differently. converges to 0, while converges to a constant far away from 0. The sample geometric means and also change substantially as the sample size increases. The panels in Figure 2 show the histograms of (the ‘A’ series) and (the ‘B’ series) after 100, 000 Monte Carlo replications. Although the distribution of is skewed for relatively small sample sizes (n=10), the skewness almost disappears for relatively large sample sizes. However, the distribution of is skewed for all sample sizes, particularly for large sample sizes. Since GM=0, most of the sample geometric means clustered around 0 when n=10, 000.

Figure 2. Histogram of sample geometric means from the distributions of X (part A) and X* (part B) in formula (2) for different sample sizes

Hypothesis testing using geometric means

Let X11, ..., X1,n1 and X21, ..., X2,n2 be be two independent samples. Suppose we want to test the hypothesis: H0 : GM=GM . Due to detection limit, only the modified data can be used. The test statistic used in biomedical research is of the form where Sk*2 is the sample variance of logXk1*, ..., logX*(k=1, 2). In the simulation studies, X11 has the same distribution as defined in equation (2) with C0=0.277602 and X21 is defined as 2X11. Figure 3 shows the histograms of p-values of the test statistic T* for different sample sizes. In our example both samples have the same geometric means, so the null hypothesis is true and the distribution of the p-values of the statistic test T* should be close to the uniform distribution, at least for large sample sizes. However, the histograms shown in Figure 3 clearly indicate otherwise. Thus results of testing the null hypothesis when using the modified data are difficult to interpret and can be quite misleading.

Histograms of p-values of the test statistic in formula (3) for different sample sizes

Discussion

In this paper we consider the effect of the most common method of data imputation used in biomedical research for results that are below a detection limit. Despite its popularity, this method of using imputed values to compute a sample geometric mean which is used to estimate the population geometric mean (needed in many common statistical analyses) has not been adequately reviewed in the statistical literature. We use simulation studies to show that that the sample geometric mean is a very unstable statistic, so even small modifications introduced by this common imputation method can have a major effect on the estimated true (population) geometric mean and, thus, on statistical inference.[4] The sample geometric mean based on data that includes imputed values can be quite different from the true geometric mean, so the conclusions of hypothesis testing based on the use of modified data can be misleading. All these problems stem from a very special property of the geometric mean: a positive random variable may have a geometric mean of 0. However, given a random sample from the distribution of a positive random variable, there is no method to determine whether or not the population geometric mean is 0, a problem that is compounded by the detection limit issue that requires the use of imputed values when computing the sample geometric mean. Any computed estimate of the geometric mean under the detection limit is uninterpretable. Another issue with detection limit is measurement error. In this paper we assume that there is no measurement error from the device or instrument. The effect of potential measurement error on the detection limit requires further investigation.

11 in total

1. Log transformation: application and interpretation in biomedical research.

Authors: Changyong Feng; Hongyue Wang; Naiji Lu; Xin M Tu
Journal: Stat Med Date: 2012-07-16 Impact factor: 2.373

2. Serial measurements of C-reactive protein and interleukin-6 in the immediate postnatal period: reference intervals and analysis of maternal and perinatal confounders.

Authors: C Chiesa; F Signore; M Assumma; E Buffone; P Tramontozzi; J F Osborn; L Pacifico
Journal: Clin Chem Date: 2001-06 Impact factor: 8.327

3. Antibody responses to influenza viruses in paediatric patients and their contacts at the onset of the 2009 pandemic in Mexico.

Authors: Guadalupe Miranda-Novales; Lourdes Arriaga-Pizano; Cristina Herrera-Castillo; Rodolfo Pastelin-Palacios; Nuriban Valero-Pacheco; Marisol Pérez-Toledo; Eduardo Ferat-Osorio; Fortino Solórzano-Santos; Guillermo Vázquez-Rosales; Clara Espitia-Pinzón; Irma Zamudio-Lugo; Abigail Meza-Chávez; Paul Klenerman; Armando Isibasi; Constantino López-Macías
Journal: J Infect Dev Ctries Date: 2015-03-15 Impact factor: 0.968

4. Calcitonin levels are similar in goitrous euthyroid patients with or without thyroid antibodies, as well as in hypothyroid patients.

Authors: H Pantazi; P D Papapetrou
Journal: Eur J Endocrinol Date: 1998-05 Impact factor: 6.664

5. Responses to a fourth dose of Haemophilus influenzae type B conjugate vaccine in early life.

Authors: M H Slack; D Schapira; R J Thwaites; M Burrage; J Southern; D Goldblatt; E Miller
Journal: Arch Dis Child Fetal Neonatal Ed Date: 2004-05 Impact factor: 5.747

6. Generation and characterization of a cold-adapted influenza A H9N2 reassortant as a live pandemic influenza virus vaccine candidate.

Authors: H Chen; Y Matsuoka; David Swayne; Q Chen; N J Cox; B R Murphy; K Subbarao
Journal: Vaccine Date: 2003-10-01 Impact factor: 3.641

7. Chimpanzee adenovirus- and MVA-vectored respiratory syncytial virus vaccine is safe and immunogenic in adults.

Authors: Christopher A Green; Elisa Scarselli; Charles J Sande; Amber J Thompson; Catherine M de Lara; Kathryn S Taylor; Kathryn Haworth; Mariarosaria Del Sorbo; Brian Angus; Loredana Siani; Stefania Di Marco; Cinzia Traboni; Antonella Folgori; Stefano Colloca; Stefania Capone; Alessandra Vitelli; Riccardo Cortese; Paul Klenerman; Alfredo Nicosia; Andrew J Pollard
Journal: Sci Transl Med Date: 2015-08-12 Impact factor: 17.956

8. Maternal interpersonal trauma and cord blood IgE levels in an inner-city cohort: a life-course perspective.

Authors: Michelle Judith Sternthal; Michelle Bosquet Enlow; Sheldon Cohen; Marina Jacobson Canner; John Staudenmayer; Kathy Tsang; Rosalind J Wright
Journal: J Allergy Clin Immunol Date: 2009-09-12 Impact factor: 10.793

9. Rapid seeding of the viral reservoir prior to SIV viraemia in rhesus monkeys.

Authors: James B Whitney; Alison L Hill; Srisowmya Sanisetty; Pablo Penaloza-MacMaster; Jinyan Liu; Mayuri Shetty; Lily Parenteau; Crystal Cabral; Jennifer Shields; Stephen Blackmore; Jeffrey Y Smith; Amanda L Brinkman; Lauren E Peter; Sheeba I Mathew; Kaitlin M Smith; Erica N Borducchi; Daniel I S Rosenbloom; Mark G Lewis; Jillian Hattersley; Bei Li; Joseph Hesselgesser; Romas Geleziunas; Merlin L Robb; Jerome H Kim; Nelson L Michael; Dan H Barouch
Journal: Nature Date: 2014-07-20 Impact factor: 49.962

10. Significant association between Helicobacter pylori infection and serum C-reactive protein.

Authors: Yoshiko Ishida; Koji Suzuki; Kentaro Taki; Toshimitsu Niwa; Shozo Kurotsuchi; Hisao Ando; Akira Iwase; Kazuko Nishio; Kenji Wakai; Yoshinori Ito; Nobuyuki Hamajima
Journal: Int J Med Sci Date: 2008-07-24 Impact factor: 3.738