Literature DB >> 27656667

A synthetic Longitudinal Study dataset for England and Wales.

Adam Dennett1, Paul Norman2, Nicola Shelton3, Rachel Stuchbury3.   

Abstract

This article describes the new synthetic England and Wales Longitudinal Study 'spine' dataset designed for teaching and experimentation purposes. In the United Kingdom, there exist three Census-based longitudinal micro-datasets, known collectively as the Longitudinal Studies. The England and Wales Longitudinal Study (LS) is a 1% sample of the population of England and Wales (around 500,000 individuals), linking individual person records from the 1971 to 2011 Censuses. The synthetic data presented contains a similar number of individuals to the original data and accurate longitudinal transitions between 2001 and 2011 for key demographic variables, but unlike the original data, is open access.

Entities:  

Keywords:  Demography; Geography; Health Sciences; Longitudinal; Microdata; Social Science

Year:  2016        PMID: 27656667      PMCID: PMC5021767          DOI: 10.1016/j.dib.2016.08.036

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table Value of the data This data allows students and researchers who are unfamiliar with longitudinal census-based microdata to gain familiarity through hands-on experimentation. Data can be accessed without the usual restrictions that are placed on the England and Wales Longitudinal Study. This dataset should inspire new substantive pieces of longitudinal research as once more individuals familiar with longitudinal demographic analysis and the research opportunities it presents, then more ideas may flourish.

Data

The main data file spreadsheet accompanying this article contains 569,741 rows of data (representing 1 individual person per row) with the first 17 columns (in green) containing variables derived from responses to the 2011 Census. The 8 columns immediately following (in yellow) are synthetic longitudinal transition variables estimating the individual׳s state in the 2001 Census. The final two columns contain synthetic estimates of whether the individual would have given birth to children (and how many) or died over the 10-year period. Metadata for all variables are contained in the first two sheets. Supplementary materials also accompanying this article include transitional probability tables for each synthetic variable and R code to generate the synthetic variables from these transitions.

Experimental design, materials and methods

The method we employ is, at its core, a simple one-dimensional proportional fitting exercise making it somewhat more straightforward than the multi-dimensional iterative proportional fitting first proposed by Deming and Stephan [1]. It has been necessary to avoid multi-dimensional variable interactions due to the small cell counts that would occur in the transition matrices. Our base dataset is the 2011 Census Microdata Teaching File1. Transitional probabilities for each variable (for example not married to married or good health to bad health – see Supplementary material) are derived from the LS for a series of 10-year age groups. All transitions are accurate when aggregated to these age groups, although not necessarily when aggregated to another variable such as geographic region. In the Census Microdata Teaching File, age is recorded for 8 uneven age groups: 0–15 16–24 25–34 35–44 45–54 55–64 65–74 75 and over These groups are re-estimated so that we have 11 even 10 year groups: 0–9 10–19 20–29 30–39 40–49 50–59 60–69 70–79 80–89 90–99 100+ To carry out the re-estimation to new groups, the single year of age for each person in each original age group is estimated before they can then be allocated a new broad age group. To estimate the single year of age for each of the 569,741 individuals in the dataset, we use data on single year of age for each UK region from the 2011 Census aggregate tables2. These Census tables can be aggregated into any age group required and the relative proportions each single age comprises in each group calculated. In doing this, single year of age counts are disaggregated by region due to the large differences in the proportion of the population in each age group in London compared to all other regions in England and Wales. The total number of individuals of single year of age in region will be a fraction of the total number of individuals in age group in region : Such that:and By calculating all proportions of for each age group using the Census aggregate tables single year of age file, it is possible to decompose and re-estimate age group data as required. The estimation of each longitudinal variable transition is carried out in almost exactly the same way for each variable (with some minor variations). Below the general process is described using Approximated Social Grade as the exemplar. Stage 1 – Transitional matrices of the same format are generated for each variable of interest from the ONS Longitudinal Study. These are broadly comparable to the example table below (Table 1) which shows the transitional counts for the Approximated Social Grade variable.
Table 1

Transitional counts between states of Approximated Social Grade, 2011–2001.

2011200110–1920–2930–3940–4950–5960–6970–7980–8990–99
1102004968860875905805332910920
1201033545649544200368217754170
130183808903695505309870
14030032250108983910688542320
2111236276954505062412821527250
22291865896911,86610,9549233566820900
231042216251850154114498432180
24604732415937222828278522278210
31083669130312609994411140
320556203132953054288415253990
330805310753185225402924786050
3402105346449334255393422945910
410955381003107614399433130
4210690216230723115372523096160
430411140322812452284422627150
44113379672310,162986410,4339347353111
As 2011 is our base population, transitional probabilities are calculated from the counts of transitions with each 2001 state calculated as a proportion of the corresponding 2011 state in turn. Table 2 exemplifies this more clearly:
Table 2

Transitional probabilities for Approximated Social Grade, 2011–2001.

2011200110–1920–2930–3940–4950–5960–6970–7980–8990–99
1100.0450.3680.5530.5700.5250.5310.5970
1200.2340.4050.3190.3150.3330.2830.2280
1300.0410.0600.0580.0520.0460.0490.0480
1400.6800.1670.0700.0630.0970.1360.1270
210.10.0330.1580.2380.2480.2350.1980.1880
220.2640.2570.5120.5180.5370.5250.5200.5420
230.0910.0580.0930.0810.0760.0820.0770.0570
240.5450.6520.2370.1630.1390.1580.2040.2130
3100.0230.0720.0880.0910.0840.0650.0670
3200.1570.2190.2220.2210.2430.2260.2330
3300.2270.3350.3580.3790.3400.3680.3540
3400.5930.3740.3320.3080.3320.3400.3460
4100.0210.0500.0610.0650.0780.0630.0600
420.4760.1510.2000.1860.1890.2020.1550.1190
4300.0900.1300.1380.1490.1540.1520.1380
440.5240.7390.6210.6150.5980.5660.6290.6821
Taking the first row of Table 1 (Transitions between social grade 1 (AB) in 2011 and social grade 1 in 2001), we can observe that at age group 20–29 (2011 age group), 200 individuals in the LS underwent that transition. Table 2 shows that this is a proportion of 0.045 (4.5%) of all people of social grade 1 at age group 20–29 in 2011 (200/(200+1033+183+3003)=0.045). For each 2011 variable, all 2001 state proportions at each age group will sum to 1. Similar transitional probability tables are generated for each of the variables we transition. Stage 2 – We apply transitional probabilities to the Microdata Teaching File data to create estimates of the total number of people undergoing each transition. Stage 3 – We use the estimates of the total number of people undergoing each transition to update (randomly) the Microdata Teaching File with expected transitions for the correct number of people. Some small variations in the estimation process were required for variables such as religion and the estimation of births and deaths. For the full estimation process for each variable, see the accompanying processing scripts written in the R language and transitional probability files (which include 2011 to 2001 transitional probabilities for each variable and single year of age counts by region from the 2011 Census).
Subject areaDemography, Geography, Health Sciences
More specific subject areaSynthetic Longitudinal Microdata Estimation
Type of dataTables, Excel files, R file
How data was acquiredThrough a synthetic estimation process
Data formatRaw
Experimental factorsData were estimated using transitional probabilities derived from the ONS Longitudinal Study for England and Wales and applied to 569,741 individuals contained within the ONS Census Microdata Teaching File
Experimental featuresData includes 10 year transitions (2011 back to 2001) for individuals for age, religion, general health, marital status, social grade, live births to mothers and deaths
Data source locationEngland and Wales
Data accessibilityData is within this article
  1 in total

1.  Cohort Profile: the Office for National Statistics Longitudinal Study (The LS).

Authors:  Nicola Shelton; Chris E Marshall; Rachel Stuchbury; Emily Grundy; Adam Dennett; Jo Tomlinson; Oliver Duke-Williams; Wei Xun
Journal:  Int J Epidemiol       Date:  2019-04-01       Impact factor: 7.196

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.