Literature DB >> 21841351

Improving population health measurement in national household surveys: a simulation study of the sample design of the comprehensive survey of living conditions of the people on health and welfare in Japan.

Nayu Ikeda1, Kenji Shibuya, Hideki Hashimoto.   

Abstract

BACKGROUND: The Comprehensive Survey of Living Conditions of the People on Health and Welfare (CSLC) is a major source of health data in Japan. The CSLC is not strictly based on probabilistic sampling, but instead uses an equal allocation of sample clusters to yield equal standard errors of estimates across prefectures. This study compared the performance of this sample design in measuring population health with that of an alternative probabilistic sampling approach.
METHODS: A simulation analysis was conducted using hypothetical population data (n = 34 262 865) from which 1000 sample datasets were randomly drawn using 2 sampling methods, namely, a conventional stratified random sampling of a constant number of clusters and an alternative 2-stage cluster sampling of households with probability proportional to size. The root mean squared error was used to measure the accuracy of estimated means of a continuous variable and proportions of its dichotomized variable.
RESULTS: The alternative method reduced the variability of estimates in the total population and by strata. It improved further with an increased number of sample clusters in conjunction with a reduced sampling rate of households from selected clusters.
CONCLUSIONS: The alternative sample design increased the overall accuracy of population estimates of continuous and dichotomous variables from the CSLC. These benefits should be carefully weighed against the costs incurred in traveling to additional clusters in large prefectures. Further simulation research is necessary to investigate the performance of sampling designs for nominal and ordinal response variables.

Entities:  

Mesh:

Year:  2011        PMID: 21841351      PMCID: PMC3899438          DOI: 10.2188/jea.JE20100102

Source DB:  PubMed          Journal:  J Epidemiol        ISSN: 0917-5040            Impact factor:   3.211


INTRODUCTION

The Comprehensive Survey of Living Conditions of the People on Health and Welfare (CSLC) is a major source of data for tracking trends in population health and for the evaluation of health programs in Japan. The CSLC is a large-scale survey that is conducted every 3 years to provide information for the assessment of health outcomes at the subnational level of 47 prefectures, while small-scale surveys on the status of households and their income are implemented during the interim. In this large-scale survey, to ensure a sufficient sample size and equal errors of estimates across prefectures, a constant number of clusters are randomly selected from prefectures and designated cities with a population of more than 500 000.[1] For example, 100 clusters are sampled from each prefectures that does not have a designated city, so that target precision for total estimates of households remains approximately 2% to 3% across prefectures.[2] The clusters are census enumeration areas consisting of 50 households on average,[3] and all households in the sample clusters are asked to participate in the survey.[2] The sample design of the CSLC raises 2 issues. First, under an equal allocation of sample clusters, the sample does not reflect the distribution of the total population because the population size substantially differs across prefectures. Thus, in the absence of appropriate adjustment, estimates of population parameters based on such samples may be subject to considerable sampling errors. Although the survey provides a ratio of a sample size to an estimated Japanese population in each prefecture as sample weights, they are only useful for expanding estimated totals of the number of households or household members from a sample to the subnational level, which is a primary purpose of the CSLC. The second problem of this sample design is that the confidentiality of personal information may be violated during the dissemination of secondary data for scientific research. Given that all households in selected clusters are included in a study sample, the possibility cannot be completely excluded that, without data masking, individuals or households might be identified from variables related to the sample design or any other identifying information in secondary data released for public use. A potential alternative approach to overcome these limitations in the current CSLC sample design is to use 2-stage cluster sampling of households with probability proportional to size. In theory, this sampling procedure allows a sample to be proportional to the distribution of the whole population and may also improve confidentiality, which would be advantageous for respondents. With appropriate sampling fractions, this alternative strategy might be able to maintain the original target sample size of each prefecture in the CSLC. However, it is not known how well this sampling approach compares with the conventional constant cluster sampling of the CSLC in the estimation of population values. This lack of evidence is partly attributable to the fact that population parameters are usually unknown. This study compared the statistical performance of the conventional and alternative sample designs by conducting a simulation study based on a hypothetical population. The major advantage of this simulation approach is that the known true values (ie, population means and variances) can be used as a benchmark for the assessment of the statistical performance of the sampling strategies. Previous studies have applied simulation techniques to investigate a number of important issues in medical statistics, epidemiology, and other fields.[4]–[7] We hope that the present analysis will provide a useful example of generating evidence for discussions of the establishment of a health information base through the redesign of national household health surveys in Japan.

METHODS

Population data

A dataset of a hypothetical population was created for the simulation analysis. The artificial population was intended to be approximately one fifth the size of the population of Japan. The population data had 10 strata, and the numbers of clusters, households, and individuals were generated by pseudorandom number generators with predetermined initial values and distributions. The number of household members of the jth household in the ith cluster of the hth stratum, N, followed the discrete uniform distribution on the integers between 1 and 6:The number of households in the ith cluster of the hth stratum, N, was distributed normally with a mean of 50 and a variance of 1:The mean and variance of N were specified so that cluster sizes were consistent with the sizes of census enumeration areas. The number of clusters in the hth stratum, N, followed the discrete uniform distribution on the integers between 4000 and 40 000:The range of N corresponded to that of the number of census enumeration areas by prefecture in the 2005 Population Census of Japan, which was the sampling frame of the 2007 CSLC.[2] A continuous random variable X was created as a benchmark for assessing the statistical performance of the sample designs. The idea for this variable originated from systolic blood pressure in millimeters of mercury. A normal distribution was assumed in generating pseudorandom numbers for X with different means and variations across strata, clusters, and households. X was assigned to the kth individual of the jth household in the ith cluster of the hth stratum aswhere μ was a household mean of X and was a variance of X across individuals within households. These 2 parameters at the household level were given aswhere μ and signify a mean and variance, respectively, of household means of X within clusters. The 2 parameters at the cluster level were generated aswhere μ and denote a mean and variance, respectively, of cluster means of X within strata. These 2 parameters at the stratum level were specified asAll the specific numbers above were arbitrarily defined, except for the mean and standard deviation of μ across strata, which reflect distributions of systolic blood pressures estimated from the National Health and Nutrition Surveys.[8] As part of our attempt to investigate the performance of the sampling designs for categorical variables, the continuous X was further dichotomized to create a binary variable that indicated 1 for individuals having X equal to or greater than 140 and 0 for all other individuals.

Sampling

A random sample of individuals was drawn from the population data, using the 2 sample designs mentioned above. Sampling was replicated 1000 times to obtain 1000 sample datasets for each sample design. One of the sample designs followed that of the CSLC (Method 1): 100 clusters were selected from each of the 10 strata by systematic random sampling without replacement, and all households in the 1000 selected clusters were included in a sample. Sample weights for Method 1 were computed as the inverse of the proportion of the number of selected individuals to the population in each stratum. The weights were thus constant across observations within each stratum. The other sample design was the 2-stage cluster sampling of households (Method 2), in which, after the data were sorted by identifiers of strata and clusters, clusters were selected throughout the 10 strata with probabilities proportional to the number of households without replacement in the first stage, and households were selected from each sample cluster by simple random sampling without replacement in the second stage. Five scenarios were established for Method 2 by using the total sample size of clusters in the first stage and a sampling fraction of households in the second stage: (1) 1000 clusters and 100%, (2) 2000 clusters and 50%, (3) 3000 clusters and 33%, (4) 4000 clusters and 25%, and (5) 5000 clusters and 20%. Sample weights for Method 2 were constructed as the inverse of the product of the probability of each cluster being selected and that of each household being sampled from each cluster. The weights were thus different across clusters, but were constant within clusters, for Method 2.

Assessment

The mean of the continuous X and the proportion of its binary variable being equal to 1 (X ≥ 140) were estimated from each of the 1000 sample datasets to obtain a sampling distribution of 1000 estimates of each variable in total population and by strata. The survey commands of Stata were used to consider the complex survey designs including unequal probabilities of selection in the estimation procedure.[9] All analyses were conducted with Stata/MP version 11.0 (StataCorp, College Station, TX, USA). To compare the statistical performance of the 2 sample designs, the root mean squared errors (RMSEs) of the estimated means and proportions were computed from the sampling distributions. The RMSE is the square root of the sum of the variance and the squared bias of an estimator. In other words, it provides a summary measure of the overall accuracy of an estimator by integrating the standard deviation of a sampling distribution (efficiency) and the deviation of an expected value from a true value in the population (bias).[10] In this study, the RMSE equals the variance because estimated means and proportions are unbiased under the simple weighted estimation for complex survey data.

RESULTS

Table 1 shows the population size and basic statistics of X in the hypothetical population data. In total, the dataset had 34 262 865 individuals, 9 791 108 households, and 195 821 clusters. The population size by strata was comparable to the estimated Japanese population by prefecture in 2005: for instance, the smallest strata (ie, the fourth and ninth) were similar in size to Tottori and Shimane, whereas the 10th stratum was as large as Osaka prefecture excluding Osaka City.[3] In the whole population, the mean of X was 129.8 (standard deviation, 13.4), and the proportion of X that was equal to or greater than 140 was 22%.
Table 1.

Population size and basic statistics of a continuous variable X in a hypothetical population by strata

Stratum IDClustersHouseholdsIndividualsMean of XX ≥ 140 (%)
122 7081 135 4253 969 109131.024.8
26043302 3081 058 277126.014.4
331 1761 558 7085 455 087128.318.9
44161208 094726 722127.416.9
518 121905 8603 172 896131.826.6
618 841942 1053 296 412133.531.1
721 1511 057 7103 701 249130.022.4
832 1431 607 1125 623 977126.215.0
94538226 915794 826129.120.4
1036 9391 846 8716 464 310131.626.2
Table 2 shows the average size of the 1000 sample datasets by strata and sample design. Method 1 sampled approximately 17 500 members of 5000 households in 100 clusters from each stratum. When Method 2 was used to sample 1000 clusters in total, the number of selected clusters was much lower than 100 in the smallest strata, while it increased in large strata by up to 89%.
Table 2.

Average size of 1000 sample datasets by strata and sample design

Stratum IDMethod 1Method 2 (by number of sample clusters)

10002000300040005000
Clusters
1100116232348464580
2100316293123154
3100159318478637796
49921436485106
510093185278370463
610096192289385481
7100108216324432540
8100164328492657821
910023467093116
10100189377566754943
Total99910001999300240005000
Households
1499457995830586059585799
2499915461554156015871544
3499579608003804481797961
4493710631070107410921063
5500246274651467447524626
6499348124838486249444812
7500154045432546055525402
8500182098253829384318208
9500111591165117111911159
10498694339483953196919427
Total49 90950 01250 27950 52951 37750 001
Individuals
117 43420 27220 37720 49020 82520 269
217 40954115439546455595402
317 67527 86028 01128 15528 62227 868
417 23637123737375038113714
517 41716 20716 28816 37216 63916 200
617 49616 83316 92817 01817 30016 837
717 55618 90619 00219 10919 42918 901
817 67228 72728 88329 01929 50828 732
917 38340614081410441714061
1017 43533 00133 19333 35733 92532 997
Total174 713174 990175 939176 838179 789174 981

Method 1, stratified sampling of a constant number of clusters; Method 2, two-stage cluster sampling of households.

Method 1, stratified sampling of a constant number of clusters; Method 2, two-stage cluster sampling of households. Table 3 presents the RMSE of 1000 estimated means of X by strata and sample design. Using Method 2, sampling of 1000 clusters reduced the RMSE by 12% in the total population by changing the sampling method of clusters from simple random sampling of a fixed number of clusters in each stratum to sampling with probability proportional to size across strata. This sampling method also lowered the RMSE by 20% in large strata, although the RMSE considerably increased in small strata, mainly because of the abovementioned decrease in their sample size.
Table 3.

Root mean squared error of 1000 estimates by strata and sample design

Stratum IDMethod 1Method 2 (by number of sample clusters)

10002000300040005000
Mean of continuous X
10.5220.5040.3200.2820.2370.232
20.4980.9310.7080.5930.5420.541
30.5400.4180.2970.2720.2240.186
40.6531.0670.7560.6840.5460.557
50.5020.5690.4380.3750.3300.282
60.4860.5340.4000.3220.3010.276
70.5110.4590.3420.2850.2500.214
80.5260.4060.3270.2580.2210.204
90.5541.0500.8300.6790.5560.563
100.4750.3740.2640.2460.2020.172
Total0.1900.1680.1190.1070.0830.082
Proportion of X ≥ 140
10.0130.0130.0080.0070.0060.006
20.0080.0170.0130.0100.0100.010
30.0120.0090.0060.0060.0050.004
40.0140.0210.0160.0140.0120.012
50.0130.0150.0110.0100.0090.007
60.0130.0150.0110.0090.0090.008
70.0120.0110.0080.0070.0060.006
80.0100.0070.0060.0050.0040.004
90.0130.0230.0180.0150.0130.014
100.0120.0100.0070.0070.0050.005
Total0.0040.0040.0030.0030.0020.002

Method 1, stratified sampling of a constant number of clusters; Method 2, two-stage cluster sampling of households.

Method 1, stratified sampling of a constant number of clusters; Method 2, two-stage cluster sampling of households. As the number of sample clusters increased in Method 2, the RMSE of the estimated means of X for the total population continued to decline and stabilized at around two fifths of that of Method 1 when a quarter of households were sampled from 4000 clusters (Table 3). The RMSE of Method 2 also decreased across strata and was nearly equal to or less than that of Method 1 in all strata when 4000 clusters were selected in total. Similar results were obtained for the RMSE of the proportion estimates of X ≥ 140 both in the total population and by strata (Table 3).

DISCUSSION

In designing national health surveys, it is essential to maximize the quality of health information, given the constraints on resources. This is particularly so for the CSLC because it is the largest health interview survey in Japan and serves as a sampling frame for some other national health surveys. The large-scale surveys of the CSLC currently employ an equal allocation of sample clusters to ensure equal errors of estimates across prefectures. The present simulation study confirmed that an alternative multistage probabilistic sampling might enhance the overall accuracy of estimates in a number of prefectures as well as in the whole population. A substantial part of this improvement was achieved by reducing variation in estimates by increasing the number of sample clusters and decreasing the sampling rate of households within clusters. A major concern in introducing this alternative sample design is that traveling to more clusters might add to the burden on public health centers in large prefectures. However, this may not necessarily occur, because the sampling fraction of interview households decreases with the number of clusters selected. Moreover, it is not clear whether large prefectures currently share an appropriate burden for their population size or can still accept additional survey clusters to maintain balance with other prefectures. Another concern regarding the implementation of the proposed survey design is that standard errors of estimates in small prefectures may become too large to be compared with those of other prefectures. However, our findings suggest that when the total number of clusters in a sample is adequate, the proposed sampling method also improves the variability of estimates in small prefectures. There is unlikely to be a large increase in the burden on small prefectures after switching to multistage proportional sampling, because the numbers of interview households and clusters do not exceed those of the conventional survey approach. Using the alternative survey design, a comparison of estimates at the subnational level may still be possible with reference to uncertainty intervals that appropriately reflect the population distribution and different sample sizes across prefectures. In addition, estimates for the total population that are derived without resorting to ratio estimates would theoretically have better comparability than those of small-scale surveys of the CSLC that employ a probabilistic sampling design. The introduction of this alternative method thus requires shifting the purpose of sampling designs from equal errors of estimates to the enhanced accuracy of parameter estimates across prefectures and in the whole population. The Japanese health information system needs substantial reform in the design of national household surveys. To obtain nationally representative samples, a multistage probabilistic sampling survey design is becoming the norm for household health surveys across the world.[11] It is also crucial to construct sample weights that account for any sampling errors and even to go as far as considering post-stratification weighting for nonresponse and noncoverage of subgroups.[12] It is worthwhile to investigate how these elements of probabilistic sampling might be incorporated into the current sample design of the CSLC, so that information on population health could be generated with increased accuracy and compatibility while carefully considering resource implications. The current study did have limitations that should be considered when interpreting the results. First, for ease of analysis, a continuous variable and its dichotomized variable were used for the assessment of sample designs. However, most of the variables collected by the large-scale CSLC were nominal or ordinal. It remains to be seen in future studies whether the findings from this study apply to multinomial and ordinal response scales. In addition, our estimates were based on simple weighted estimation techniques that took account of complex survey designs, although the large-scale CSLC employed ratio estimation using the number of household members as an auxiliary variable. Because ratio estimation is preferable only when variables of interest strongly correlate with the auxiliary variable,[13] our estimation strategy is nevertheless appropriate for studying sample designs in the context of general variables that might be introduced in future health surveys. Second, this study did not incorporate post-stratification weights to adjust for bias caused by nonresponse. This is also a major issue in the redesign of the CSLC that will be examined in future studies. These limitations, however, are outweighed by the fact that this study is the first empirical assessment of sample designs used in Japanese health surveys. The simulation approach introduced in this article has proven to be a useful tool for testing the performance of designs of complex surveys and clinical trials.[7] This analytic technique should be further applied in future research to investigate other important issues related to the sample design of the CSLC and other relevant surveys, such as how to ensure an adequate sample size for representing prefectures in smaller national surveys using the CSLC as a master sample.[1] In conclusion, the alternative sampling approach proposed in this study was superior to the present CSLC strategy in obtaining accurate survey estimates of population parameters both by prefecture and in the entire population. Globally, multistage household surveys are now the standard and a key platform for understanding population health. Academics and policymakers should carefully examine the costs and benefits of this alternative survey strategy as they pertain to redesigning the CSLC to improve the quality of national health information and promote better understanding of population health in Japan.
  4 in total

1.  A comparison of inclusive and restrictive strategies in modern missing data procedures.

Authors:  L M Collins; J L Schafer; C M Kam
Journal:  Psychol Methods       Date:  2001-12

2.  A comparison of imputation methods in a longitudinal randomized clinical trial.

Authors:  Lingqi Tang; Juwon Song; Thomas R Belin; Jürgen Unützer
Journal:  Stat Med       Date:  2005-07-30       Impact factor: 2.373

3.  A computer simulation of household sampling schemes for health surveys in developing countries.

Authors:  S Bennett; A Radalowicz; V Vella; A Tomkins
Journal:  Int J Epidemiol       Date:  1994-12       Impact factor: 7.196

4.  The design of simulation studies in medical statistics.

Authors:  Andrea Burton; Douglas G Altman; Patrick Royston; Roger L Holder
Journal:  Stat Med       Date:  2006-12-30       Impact factor: 2.373

  4 in total
  13 in total

1.  Widening socioeconomic inequalities in smoking in Japan, 2001-2016.

Authors:  Hirokazu Tanaka; Johan P Mackenbach; Yasuki Kobayashi
Journal:  J Epidemiol       Date:  2020-06-27       Impact factor: 3.211

Review 2.  Scientometric trends and knowledge maps of global health systems research.

Authors:  Qiang Yao; Kai Chen; Lan Yao; Peng-hui Lyu; Tian-an Yang; Fei Luo; Shan-quan Chen; Lu-yang He; Zhi-yong Liu
Journal:  Health Res Policy Syst       Date:  2014-06-05

3.  Macronutrient Intake and Socioeconomic Status: NIPPON DATA2010.

Authors:  Masaru Sakurai; Hideaki Nakagawa; Aya Kadota; Katsushi Yoshita; Yasuyuki Nakamura; Nagako Okuda; Nobuo Nishi; Yoshihiro Miyamoto; Hisatomi Arima; Takayoshi Ohkubo; Tomonori Okamura; Hirotsugu Ueshima; Akira Okayama; Katsuyuki Miura
Journal:  J Epidemiol       Date:  2018       Impact factor: 3.211

4.  Socioeconomic Inequalities in Oral Health among Middle-Aged and Elderly Japanese: NIPPON DATA2010.

Authors:  Keiko Murakami; Takayoshi Ohkubo; Mieko Nakamura; Toshiharu Ninomiya; Toshiyuki Ojima; Kayoko Shirai; Tomomi Nagahata; Aya Kadota; Nagako Okuda; Nobuo Nishi; Tomonori Okamura; Hirotsugu Ueshima; Akira Okayama; Katsuyuki Miura
Journal:  J Epidemiol       Date:  2018       Impact factor: 3.211

5.  Passive Smoking at Home by Socioeconomic Factors in a Japanese Population: NIPPON DATA2010.

Authors:  Minh Nguyen; Nobuo Nishi; Aya Kadota; Nagako Okuda; Hisatomi Arima; Akira Fujiyoshi; Yasutaka Nakano; Takayoshi Ohkubo; Hirotsugu Ueshima; Akira Okayama; Katsuyuki Miura
Journal:  J Epidemiol       Date:  2018       Impact factor: 3.211

6.  Relationships among Food Group Intakes, Household Expenditure, and Education Attainment in a General Japanese Population: NIPPON DATA2010.

Authors:  Tomomi Nagahata; Mieko Nakamura; Toshiyuki Ojima; Imako Kondo; Toshiharu Ninomiya; Katsushi Yoshita; Yusuke Arai; Takayoshi Ohkubo; Keiko Murakami; Nobuo Nishi; Yoshitaka Murakami; Naoyuki Takashima; Nagako Okuda; Aya Kadota; Naoko Miyagawa; Keiko Kondo; Tomonori Okamura; Hirotsugu Ueshima; Akira Okayama; Katsuyuki Miura
Journal:  J Epidemiol       Date:  2018       Impact factor: 3.211

7.  Socioeconomic Status Associated With Urinary Sodium and Potassium Excretion in Japan: NIPPON DATA2010.

Authors:  Naoko Miyagawa; Nagako Okuda; Hideaki Nakagawa; Toshiro Takezaki; Nobuo Nishi; Naoyuki Takashima; Akira Fujiyoshi; Takayoshi Ohkubo; Aya Kadota; Tomonori Okamura; Hirotsugu Ueshima; Akira Okayama; Katsuyuki Miura
Journal:  J Epidemiol       Date:  2018       Impact factor: 3.211

8.  Socioeconomic Status and Knowledge of Cardiovascular Risk Factors: NIPPON DATA2010.

Authors:  Masayoshi Tsuji; Hisatomi Arima; Takayoshi Ohkubo; Koshi Nakamura; Toshiro Takezaki; Kiyomi Sakata; Nagako Okuda; Nobuo Nishi; Aya Kadota; Tomonori Okamura; Hirotsugu Ueshima; Akira Okayama; Katsuyuki Miura
Journal:  J Epidemiol       Date:  2018       Impact factor: 3.211

9.  Relationship Between Socioeconomic Status and the Prevalence of Underweight, Overweight or Obesity in a General Japanese Population: NIPPON DATA2010.

Authors:  Tomiyo Nakamura; Yasuyuki Nakamura; Shigeyuki Saitoh; Tomonori Okamura; Masahiko Yanagita; Katsushi Yoshita; Yoshikuni Kita; Yoshitaka Murakami; Hiroshi Yokomichi; Nobuo Nishi; Nagako Okuda; Aya Kadota; Takayoshi Ohkubo; Hirotsugu Ueshima; Akira Okayama; Katsuyuki Miura
Journal:  J Epidemiol       Date:  2018       Impact factor: 3.211

10.  Aiming for a representative sample: Simulating random versus purposive strategies for hospital selection.

Authors:  Loan R van Hoeven; Mart P Janssen; Kit C B Roes; Hendrik Koffijberg
Journal:  BMC Med Res Methodol       Date:  2015-10-23       Impact factor: 4.615

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.