| Literature DB >> 26474313 |
Nico Nagelkerke1, Vaclav Fidler2.
Abstract
The problem of discrimination and classification is central to much of epidemiology. Here we consider the estimation of a logistic regression/discrimination function from training samples, when one of the training samples is subject to misclassification or mislabeling, e.g. diseased individuals are incorrectly classified/labeled as healthy controls. We show that this leads to zero-inflated binomial model with a defective logistic regression or discrimination function, whose parameters can be estimated using standard statistical methods such as maximum likelihood. These parameters can be used to estimate the probability of true group membership among those, possibly erroneously, classified as controls. Two examples are analyzed and discussed. A simulation study explores properties of the maximum likelihood parameter estimates and the estimates of the number of mislabeled observations.Entities:
Mesh:
Year: 2015 PMID: 26474313 PMCID: PMC4608588 DOI: 10.1371/journal.pone.0140718
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary statistics of the 4 genetic traits in the 49 pathogenic and 173 environmental legionella strains.
| Environmental strains | Pathogenic strains | |
|---|---|---|
| Trait | Mean (SD) | Mean (SD) |
| L07B8 | .42 (2.11) | 8.54 (8.24) |
| L15D6 | .34 (.43) | .81 (.56) |
| L16E4 | .99 (.34) | 1.30 (.17) |
| L33F8 | 1.13 (.29) | .83 (.44) |
Parameter estimates and their standard errors of DLR and LR estimation.
| DLR | LR | |||
|---|---|---|---|---|
| variable | estimate | SE | estimate | SE |
| μ = λ/(1-λ) | 0.180 | 0.108 | ||
| b0 | -5.255 | 2.240 | -4.200 | 1.325 |
| bL07B8 | 0.609 | 0.506 | 0.187 | 0.050 |
| bL15D6 | 0.984 | 0.567 | 0.984 | 0.412 |
| bL16E4 | 4.865 | 1.885 | 3.430 | 0.967 |
| bL33F8 | -2.566 | 1.087 | -2.060 | 0.663 |
| -2∙log(L) | 122.57 | 125.67 | ||
Fig 1Frequency distribution of predicted probabilities of being a case in Example 1.
Summary statistics of risk factors age, daily alcohol, and daily tobacco use, among esophagus cancer cases (n = 200) and controls (n = 776).
| Control | Case | |
|---|---|---|
| Risk Factor | Mean (SD) | Mean (SD) |
| Age (years) | 50.2 (14.3) | 60.0 (9.2) |
| Alcohol (g/day) | 44.3 (31.9) | 85.1 (48.5) |
| Tobacco (g/day) | % | % |
| None | 32.9 | 4.5 |
| 1–4 | 11.2 | 9.0 |
| 5–9 | 13.7 | 25.5 |
| 10–14 | 14.3 | 20.0 |
| 15–19 | 8.6 | 9.0 |
| 20–29 | 12.8 | 16.5 |
| 30–39 | 3.1 | 8.0 |
| 40–49 | 3.1 | 6.5 |
| 50- | 0.4 | 1.0 |
Parameter estimates and their standard errors of DLR and LR estimation
| DLR | LR | |||
|---|---|---|---|---|
| variable | estimate | SE | estimate | SE |
| μ = λ/(1-λ) | 0.869 | 0.385 | ||
| b0 | -2.515 | 0.361 | -2.953 | 0.248 |
| b(age-50)/10 | 1.665 | 0.380 | 1.157 | 0.187 |
| b2 (age-50)/10 | -0.526 | 0.153 | -0.357 | 0.089 |
| btobacco | 0.330 | 0.093 | 0.235 | 0.053 |
| b(alcohol-50)/10 | 0.306 | 0.080 | 0.182 | 0.025 |
| -2∙log(L) | 589.23 | 594.54 | ||
Fig 2Frequency distribution of predicted probabilities of being a case in Example 2.
Fig 3ROC curve for artificially mislabeled control group data of Example 2.