| Literature DB >> 21779325 |
Eckhard Limpert1, Werner A Stahel.
Abstract
BACKGROUND: The gaussian or normal distribution is the most established model to characterize quantitative variation of original data. Accordingly, data are summarized using the arithmetic mean and the standard deviation, by mean ± SD, or with the standard error of the mean, mean ± SEM. This, together with corresponding bars in graphical displays has become the standard to characterize variation. METHODOLOGY/PRINCIPALEntities:
Mesh:
Year: 2011 PMID: 21779325 PMCID: PMC3136454 DOI: 10.1371/journal.pone.0021403
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Misleading characterization of data.
| Discipline | |||||
| Character | Case |
|
| 95% ( | Reference |
| a) Cases based on SD | |||||
|
| |||||
| Risk factors | A- Insulin, pM |
|
|
| |
| B- Running capacity, m |
|
|
| ||
|
| |||||
| Genetics | C- KAP1, Mest, % tot. input |
|
|
| |
| Cytology | D- Exon expres., leukocytes |
|
|
| |
| Phytopathology | E- Fungic. sensitivity, mg l−1 |
|
|
| |
| b) Cases based on SEM | |||||
| F- Cells/ml, x 106 |
| 0.25± |
|
| |
| Tumorigenesis | G- Microadenomas |
| 2.06± |
|
|
|
| |||||
| H- Cell density |
| 6000± |
|
| |
|
| |||||
| Deforestation | I- Calc. P, (kg/ha) |
| 62± |
|
|
|
| |||||
| Honey | J- HMF-content, mg/kg | 10.1±0.3 (1573) | 10.1± |
| after |
a, Frequently, variation in data from across the sciences is characterized with the arithmetic mean and the standard deviation SD. Often, it is evident from the numbers that the data have to be skewed. This becomes clear if the lower end of the 95% interval of normal variation, - 2 SD, extends below zero, thus failing the “95% range check”, as is the case for all cited examples. Values in bold contradict the positive nature of the data. b, More often, variation is described with the standard error of the mean, SEM (SD = SEM · √n, with n = sample size). Such distributions are often even more skewed, and their original characterization as being symmetric is even more misleading. Original values are given in italics (°estimated from graphs). Most often, each reference cited contains several examples, in addition to the case(s) considered here. Table 2 collects further examples.
Summarizing data – Problems and solutions.
| Discipline | Description, original | 95% range | Description, recommended1 | 95% range | |||
| Subject | Case, reference |
|
|
|
|
|
|
|
| Concentration of insulin, pM, | - |
|
| - | 256 x/1.71 | 87 to 753 |
| Health risk | Running capacity, m, | - |
|
| - | 608 x/1.7 | 210 to 1760 |
| Insulin, 30 min, SD, |
| 1.5±1.34 |
| 1.12 x/1.41 | 1.12 x/2.15 | 0.242 to 5.18 | |
| nflammation, histological score, Fig 3F, GP6 in |
| 1.69± |
| 1.091 x/1.21 | 1.091 x/2.55 | 0.17 to 7.1 | |
| Inflammation in mice, mRNA expr., Fig 7b in |
|
|
| 4.33 x/1.65 | 4.33 x/2.4 | 0.76 to 24.5 | |
| Tryptophan-catabolism | Kynurenine µM, | - |
|
| - | 0.32 x/1.95 | 0.0841 to 1.22 |
| Immune response | TNFα mRNA production, Fig. 4F, 0h, | - |
|
| - | 0.318 x/2.3 | 0.0602 to 1.68 |
| Tumorigenesis | Microadenomas, frequency, p 125, line –18, |
| 2.06± |
| 1.1 x/1.75 | 1.1 x/3.06 | 0.117 to 10.3 |
| PCNA-positive cells, %, WT, 4 weeks, |
| 4.2 |
| 3.11 x/1.41 | 3.11 x/2.17 | 0.66 to 14.6 | |
|
| KAP1, Mest, % total input, Fig 3b, | - |
|
| - | 1.45 x/2.23 | 0.29 to 7.2 |
| Genetics | D- Exon expres., leukocytes, Fig 4A, above, | - |
|
| - | 11.64 x/2.07 | 2.72 to 49.8 |
| Cytology | Fus3ch concentration, nM, | - |
|
| 142 x/2.25 | 28 to 718 | |
| Number of cells/ml x 106, Fig. 7, E2, |
| 0.25± |
| 0.167 x/1.68 | 0.167 x/2.45 | 0.028 to 1.0 | |
| Evolution | Living rotifers, no., after 3d, |
| 30± |
| 10.7 x/1.42 | 10.7 x/4.2 | 0.6 to 189 |
| Virology | Virus release, x103, | - |
|
| - | 33.9 x/1.78 | 10.8 to 107 |
| Neurology | Labled gran. Cells, %, | - |
|
| - | 3.39 x/1.78 | 1.1 to 10.7 |
| Freezing kinet., %, Fig 4B, 30s, |
| 15± |
| 9.92 x/1.3 | 9.92 x/2.48 | 1.61 to 61.1 | |
|
|
| 38± |
| 17 x/1.33 | 17 x/3.5 | 1.4 to 212 | |
| Parasitology | Luciferase +activity, x 106, Fig 4e, | - |
|
| - | 156 x/2.17 | 33.2 to 731 |
| Ontogeny | Cell surv. with gremlin, Fig 3C, CFU-M, | - |
|
| - | 19.3 x/1.67 | 6.96 to 53.6 |
| Photosynthesis | Nitrite cons., mM, | - |
|
| - | 0.111 x/2.96 | 0.013 to 0.973 |
| Signal transduction | Fluorescence, |
| 3±22 |
| 0.405 x/1.71 | 0.405 x/7.4 | 0.0074 to 22.2 |
| Fertility, in mice | Ovulated oocytes/CD9+/+mice, Tab.1, |
|
|
| 8.46 x/1.28 | 8.46 x/4.87 | 0.357 to 200 |
| in plants | Transcript quantity, |
| 2.5± |
| 1.73 x/1.64 | 1.73 x/2.35 | 0.313 to 9.6 |
| Quiescense | Latency, s, p 571-left, line 23, |
| 7± |
| 4.61 x/1.27 | 4.61 x/2.49 | 0.741 to 28.7 |
|
| Bacteria in rhizosphere, 15d x 103, |
| 55±41.1 |
| 44.1 x/1.23 | 44.1 x/1.95 | 11.6 to 167 |
| Cell counts |
|
| 61± |
| 20.6 x/1.68 | 20.6 x/4.36 | 1.08 to 392 |
| Fungicide sensitivity | Botrytis cinerea – triadimenol, µg ml-1, p 173, |
|
|
| - | 3.04 x/2.16 | 0.65 to 14.3 |
| Wheat p. mildew – fenpropimorph, mg l-1, |
|
| 17.2 x/1.09 | 17.2 x/2.38 | 3.04 to 97.1 | ||
|
| Colony forming units per m3 air x 106, |
|
|
| - | 438 x/2.13 | 96.7 to 1981 |
|
|
|
|
| - | 2424 x/1.82 | 734 to 8008 | |
|
| Data indicated at log-scale, Fig. 4, 2. col., |
| 6000± |
| 3712 x/1.76 | 3712 x/2.66 | 523 to 26355 |
| Nitrate in foraminifers | Boliv. subaen., Bay of B., pmol per cell, |
| 285± |
| 191 x/1.14 | 191 x/2.45 | 32 to 1143 |
|
| Deforestat. Calc. Pi, kg/ha, |
| 62± |
| 37.1 x/1.80 | 37.1 x/2.75 | 4.89 to 282 |
|
| Reynolds stress, β, x 10-6, Fig.4, bottom right, |
|
|
| - | 0.0707 x/7.23 | 0.0014 to 3.69 |
|
| HMF-content in honey, mg/kg, after | 10.1±0.3 (1573) | 10.1± |
|
|
| 1.03 to 42 |
|
1 These results were calculated, starting from | |||||||
The collection of datasets in Table 1 is extended, and their more meaningful and, thus, recommended, descriptions based on multiplicative means and multiplicative standard errors or standard deviations are given. Some comparisons appear to be of interest. Necessarily, arithmetic means exceed multiplicative ones, starting from some 15% for small s*s around 1.7 up to more than the sevenfold for s* >7. The lower limits of the 95% ranges, relative to the means, turn increasingly negative with s* growing for the classical version, but remain positive and get smaller for the multiplicative description. Turning to upper limits, the multiplicative limit exceeds the additive one by some 17% for s* = 1.7. With s* = 2.5, the difference is about 25%. For s* = 4.2, there is no difference, and for s* = 7, the additive mean is only half the multiplicative one.
Figure 1Adequate characterization of data improves the results.
- a,b, The frequency distribution of a chemical (hydroxymethylfurfurol, HMF) in honey is used to illustrate the problem and its solution. a. Obviously, the normal density curve does not fit this skewed dataset, but the log-normal does. b. the distribution is normal after logarithmic transformation and, thus, log-normal. Back-transforming and SD from the level of the logarithms gives the multiplicative (or geometric) mean * and the multiplicative standard deviation s* that allow to characterize variation at the original scale of the data (a), c, Comparing the two types of (1 standard deviation) intervals for the datasets A-J shown in Table 1. Clearly, the multiplicative intervals are shorter, increasing, thus, the potential for differentiation. Moreover, they never lead to negative values, and usually describe the variation encountered well. d,e, Multiplicative intervals improve differentiation in an example from [20]. d, Original, additive description of variation, with two significant differences, *, and a third one, close to significance. Error bars indicate SEM. e, The multiplicative type of intervals (based on the original, unpublished data received from the authors) shown here with a log-scale on the vertical axis leads to a more plausible picture, makes all three differences more significant, and one highly significant now. Error bars indicate SEM*.
Figure 2Savings in sample size.
If the t-test, which is based on the normal distribution, is applied to (skewed) raw data, the statistical power is lower than for the optimal procedure, which consists of applying it to the log transformed values. Starting from 2 groups of log-normal data with a given s*, we calculate the sample size needed in each group to achieve the same (simulated) statistical power with the (inappropriate) t-test applied to the raw data as with the optimal test, applied to n = 5, 10, and 50 observations in each group. This sample size is a function of s*. For the median skewness, s* = 2.4, 16 observations are needed instead of 10, corresponding to 60% additional effort.