| Literature DB >> 18288577 |
Fredrik A Dahl1, Margreth Grotle, Jūrate Saltyte Benth, Bård Natvig.
Abstract
There is growing concern in the scientific community that many published scientific findings may represent spurious patterns that are not reproducible in independent data sets. A reason for this is that significance levels or confidence intervals are often applied to secondary variables or sub-samples within the trial, in addition to the primary hypotheses (multiple hypotheses). This problem is likely to be extensive for population-based surveys, in which epidemiological hypotheses are derived after seeing the data set (hypothesis fishing). We recommend a data-splitting procedure to counteract this methodological problem, in which one part of the data set is used for identifying hypotheses, and the other is used for hypothesis testing. The procedure is similar to two-stage analysis of microarray data. We illustrate the process using a real data set related to predictors of low back pain at 14-year follow-up in a population initially free of low back pain. "Widespreadness" of pain (pain reported in several other places than the low back) was a statistically significant predictor, while smoking was not, despite its strong association with low back pain in the first half of the data set. We argue that the application of data splitting, in which an independent party handles the data set, will achieve for epidemiological surveys what pre-registration has done for clinical studies.Entities:
Mesh:
Year: 2008 PMID: 18288577 PMCID: PMC2270357 DOI: 10.1007/s10654-008-9230-x
Source DB: PubMed Journal: Eur J Epidemiol ISSN: 0393-2990 Impact factor: 8.082
Parameter estimates from Part 2, controlling for age, gender, and marital status
| Predictor | OR estimate | 95% CI for OR | |
|---|---|---|---|
| Number of pain sitesa | 0.015 | ||
| 1 or 2 pain sites | 1.328 | (0.793–2.224) | 0.281 |
| 3 or 4 pain sites | 1.598 | (0.857–2.979) | 0.141 |
| 5 or more pain sites | 3.941 | (1.700–9.136) | 0.001 |
| Smoking | 0.993 | (0.627–1.571) | 0.487* |
aThe reference category for number of pain sites was no pain sites
*1-sided P-value
Parameter estimates of the hypotheses variable number of pain sites, from the complete data set controlling for age, gender, marital status and smoking status
| Predictor | OR estimate | 95% CI for OR | |
|---|---|---|---|
| Number of pain sitesa | 0.000 | ||
| 1 or 2 pain sites | 1.637 | (1.116–2.400) | 0.012 |
| 3 or 4 pain sites | 1.983 | (1.285–3.061) | 0.002 |
| 5 or more pain sites | 3.346 | (1.846–6.067) | 0.000 |
aThe reference category for number of pain sites was no pain sites
Parameter estimates from Part 1, controlling for age, gender, and marital status
| Predictor | OR estimate | 95% CI for OR | |
|---|---|---|---|
| Number of pain sitesa | 0.012 | ||
| 1 or 2 pain sites | 2.292 | (1.248–4.208) | 0.007 |
| 3 or 4 pain sites | 2.690 | (1.406–5.147) | 0.003 |
| 5 or more pain sites | 2.944 | (1.193–7.262) | 0.019 |
| Smoking | 2.079 | (1.285–3.363) | 0.003 |
aThe reference category for number of pain sites was no pain sites