| Literature DB >> 24293386 |
Antony M Overstall1, Ruth King, Sheila M Bird, Sharon J Hutchinson, Gordon Hay.
Abstract
Estimating the size of hidden or difficult to reach populations is often of interest for economic, sociological or public health reasons. In order to estimate such populations, administrative data lists are often collated to form multi-list cross-counts and displayed in the form of an incomplete contingency table. Log-linear models are typically fitted to such data to obtain an estimate of the total population size by estimating the number of individuals not observed by any of the data-sources. This approach has been taken to estimate the current number of people who inject drugs (PWID) in Scotland, with the Hepatitis C virus diagnosis database used as one of the data-sources to identify PWID. However, the Hepatitis C virus diagnosis data-source does not distinguish between current and former PWID, which, if ignored, will lead to overestimation of the total population size of current PWID. We extend the standard model-fitting approach to allow for a data-source, which contains a mixture of target and non-target individuals (i.e. in this case, current and former PWID). We apply the proposed approach to data for PWID in Scotland in 2003, 2006 and 2009 and compare with the results from standard log-linear models.Entities:
Keywords: censoring; incomplete contingency table; log-linear models; people who inject drugs; population size
Mesh:
Year: 2013 PMID: 24293386 PMCID: PMC4285225 DOI: 10.1002/sim.6047
Source DB: PubMed Journal: Stat Med ISSN: 0277-6715 Impact factor: 2.373
Summary statistics for the data for years 2003, 2006 and 2009 showing the number of PWID observed in total and by each data-source.
| Social enquiry reports | Hospital records | SDMD | HCV diagnosis | ||
|---|---|---|---|---|---|
| Year | Total | S1 | S2 | S3 | S4 |
| 2003 | 7201 | 1431 | 688 | 5151 | 761 |
| 2006 | 5670 | 901 | 953 | 3504 | 827 |
| 2009 | 4967 | 831 | 779 | 2946 | 888 |
SDMD, Scottish drug misuse database; HCV, Hepatitis C virus.
True values of the model parameters for the two different true data-generating models considered in the simulation study in Section 2.5.
| Term | Parameter | Value under model (i) | Value under model (ii) |
|---|---|---|---|
| S1 | − 0.75 | − 0.75 | |
| S2 | − 0.75 | − 0.75 | |
| S3 | − 0.75 | − 0.75 | |
| S4 | − 0.75 | − 0.75 | |
| S1 : S2 | 0.25 | 0.25 | |
| S1 : S3 | 0.00 | 0.00 | |
| S1 : S4 | 0.25 | 0.00 | |
| S2 : S3 | 0.00 | 0.00 | |
| S2 : S4 | 0.25 | 0.00 | |
| S3 : S4 | 0.00 | 0.00 |
Model (i) has three non-zero interactions and model (ii) has one.
Coverage rates and mean lengths (relative to the INC-C method) of 95% HPDIs from the simulation study in Section 2.5 for the total population size, N, and the proportion, φ, of individuals observed by the S4 data-source who are members of the target population.
| True total population size | Method | Coverage rate (%) | Relative mean length | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Model (i) | Model (ii) | Model (i) | Model (ii) | ||||||
| 10 000 | INC-C | 93 | 93 | 93 | 95 | 1.00 | 1.00 | 1.00 | 1.00 |
| REM-C | 93 | NA | 94 | NA | 1.00 | NA | 1.02 | NA | |
| IGN-C | 35 | NA | 2 | NA | 1.37 | NA | 1.06 | NA | |
| MY-C | 90 | 88 | 93 | 92 | 1.02 | 0.98 | 1.12 | 1.03 | |
| 15 000 | INC-C | 92 | 94 | 94 | 96 | 1.00 | 1.00 | 1.00 | 1.00 |
| REM-C | 92 | NA | 94 | NA | 1.00 | NA | 1.01 | NA | |
| IGN-C | 26 | NA | 2 | NA | 1.49 | NA | 1.04 | NA | |
| MY-C | 89 | 87 | 92 | 91 | 1.03 | 1.01 | 1.13 | 1.06 | |
| 20 000 | INC-C | 93 | 96 | 95 | 96 | 1.00 | 1.00 | 1.00 | 1.00 |
| REM-C | 94 | NA | 94 | NA | 1.00 | NA | 1.02 | NA | |
| IGN-C | 21 | NA | 1 | NA | 1.57 | NA | 1.03 | NA | |
| MY-C | 89 | 87 | 92 | 91 | 1.06 | 1.01 | 1.18 | 1.09 | |
The coverage rates and relative mean lengths are given for each true total population size (10 000, 15 000 and 20 000), each true data-generating model (non-zero interactions for (i) S1:S2, S1:S4 and S2:S4 and (ii) S1:S2) and each of the four methods. Coverage rate refers to the proportion of intervals that contain the true value of the parameter. The mean length refers to the mean of the difference between the lower and upper bounds of each interval. An entry of NA indicates that an estimate of this parameter is not available under this method.
Posterior mean and 95% highest posterior density intervals for the total PWID population size in Scotland (to nearest 100) for the years 2003, 2006 and 2009 using the INC-C, REM-C and IGN-C methods.
| Year | INC-C | REM-C | IGN-C | |||
|---|---|---|---|---|---|---|
| Posterior mean | 95% HPDI | Posterior mean | 95% HDPI | Posterior mean | 95% HDPI | |
| 2003 | 16 700 | (14 300, 20 900) | 16 500 | (14 200, 20 800) | 27 500 | (20 700, 32 300) |
| 2006 | 22 900 | (16 300, 27 000) | 24 000 | (19 500, 29 700) | 31 000 | (24 600, 37 700) |
| 2009 | 15 200 | (11 500, 18 600) | 16 000 | (11 500, 19 400) | 31 000 | (24 000, 38 900) |
HPDI, highest posterior density interval.
Figure 1Plots of the posterior density for the total number of people who inject drugs (PWID) in each year for the INC-C, REM-C and IGN-C methods.
Posterior mean and 95% highest posterior density intervals for the total population size (for INC-C, REM-C and IGN-C) and the proportion of individuals observed by the Hepatitis C virus diagnosis data-source who are current people who inject drugs (for INC-C). These posterior estimates are given for each year and for each age group.
| Year | INC-C | REM-C | IGN-C | ||||
|---|---|---|---|---|---|---|---|
| Posterior mean | Posterior mean | Posterior mean | Posterior mean | ||||
| (95% HPDI) | (95% HPDI) | (95% HPDI) | (95% HPDI) | ||||
| for proportion | for total population | for total population | for total population | ||||
| < 35 years | |||||||
| 2003 | 0.62 (0.49, 0.78) | 12800 | (10900, 16000) | 12700 | (10800, 16000) | 19900 | (15100, 23300) |
| 2006 | 0.68 (0.50, 0.81) | 15100 | (10800, 17900) | 15900 | (12700, 19600) | 18900 | (15000, 23100) |
| 2009 | 0.58 (0.41, 0.73) | 10400 | (7800, 12800) | 11000 | (7800, 13400) | 17000 | (13100, 21400) |
| 35 + years | |||||||
| 2003 | 0.46 (0.32, 0.63) | 3800 | (2800, 5300) | 3800 | (2700, 5200) | 7600 | (5500, 9200) |
| 2006 | 0.51 (0.34, 0.68) | 7800 | (5300, 10100) | 8100 | (6000, 10600) | 12000 | (9000, 15200) |
| 2009 | 0.22 (0.13, 0.31) | 4800 | (3600, 6000) | 5100 | (3600, 6200) | 14000 | (10200, 17900) |
HPDI, highest posterior density interval.
The marginal posterior probability for each two-way log-linear interaction term for the INC-C, REM-C and IGN-C methods for the years 2003, 2006 and 2009.
| Interaction | 2003 | 2006 | 2009 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| INC-C | REM-C | IGN-C | INC-C | REM-C | IGN-C | INC-C | REM-C | IGN-C | |
| S1 × S2 | 0.10 | 0.10 | 0.92 | 0.14 | 0.10 | 0.08 | 0.11 | 0.16 | 0.99 |
| S1 × S3 | 0.80 | 0.83 | 0.85 | 0.94 | 0.99 | 1.00 | 0.62 | 0.79 | 1.00 |
| S1 × S4 | 0.09 | NA | 0.15 | 0.12 | NA | 0.08 | 0.33 | NA | 0.14 |
| S2 × S3 | 0.24 | 0.22 | 1.00 | 0.11 | 0.11 | 0.65 | 0.40 | 0.27 | 0.99 |
| S2 × S4 | 1.00 | NA | 1.00 | 1.00 | NA | 1.00 | 0.94 | NA | 0.57 |
| S3 × S4 | 0.11 | NA | 0.17 | 0.17 | NA | 0.09 | 0.12 | NA | 0.12 |
| S1 × Age | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.69 | 0.67 | 1.00 |
| S2 × Age | 0.44 | 0.48 | 0.07 | 0.99 | 0.98 | 0.92 | 1.00 | 1.00 | 1.00 |
| S3 × Age | 0.80 | 0.76 | 1.00 | 1.00 | 1.00 | 1.00 | 0.16 | 0.16 | 1.00 |
| S4 × Age | 0.09 | NA | 0.12 | 0.07 | NA | 0.15 | 0.32 | NA | 0.22 |
| S1 × Sex | 1.00 | 1.00 | 1.00 | 0.08 | 0.08 | 0.13 | 0.07 | 0.07 | 0.13 |
| S2 × Sex | 0.07 | 0.07 | 0.06 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| S3 × Sex | 0.08 | 0.08 | 0.07 | 0.23 | 0.05 | 0.06 | 0.04 | 0.04 | 0.04 |
| S4 × Sex | 0.11 | NA | 0.06 | 0.08 | NA | 0.13 | 0.07 | NA | 0.11 |
| S1 × Region | 0.93 | 0.93 | 0.98 | 0.21 | 0.20 | 0.47 | 0.07 | 0.07 | 0.20 |
| S2 × Region | 1.00 | 1.00 | 1.00 | 0.05 | 0.06 | 0.27 | 0.04 | 0.05 | 0.14 |
| S3 × Region | 0.06 | 0.07 | 0.07 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| S4 × Region | 0.98 | NA | 1.00 | 0.09 | NA | 0.82 | 0.17 | NA | 1.00 |
| Age × Sex | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| Age × Region | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| Sex × Region | 0.04 | 0.04 | 0.02 | 0.04 | 0.04 | 0.04 | 0.03 | 0.04 | 0.03 |
The data-sources are labelled as S1, social enquiry reports; S2, hospital records; S3, Scottish drug misuse database (SDMD); and S4, HCV diagnosis data-source. An NA indicates that this interaction cannot be identified with the REM-C method.