Literature DB >> 28989981

Synthetic ALSPAC longitudinal datasets for the Big Data VR project.

Demetris Avraam1, Rebecca C Wilson1, Paul Burton1.   

Abstract

Three synthetic datasets - of observation size 15,000, 155,000 and 1,555,000 participants, respectively - were created by simulating eleven cardiac and anthropometric variables from nine collection ages of the ALSAPC birth cohort study. The synthetic datasets retain similar data properties to the ALSPAC study data they are simulated from (co-variance matrices, as well as the mean and variance values of the variables) without including the original data itself or disclosing participant information.  In this instance, the three synthetic datasets have been utilised in an academia-industry collaboration to build a prototype virtual reality data analysis software, but they could have a broader use in method and software development projects where sensitive data cannot be freely shared.

Entities:  

Keywords:  ALSPAC; Simulated data; data visualisation; synthetic data; virtual reality; visual analytics

Year:  2017        PMID: 28989981      PMCID: PMC5605951          DOI: 10.12688/wellcomeopenres.12441.1

Source DB:  PubMed          Journal:  Wellcome Open Res        ISSN: 2398-502X


Introduction

In 2015 Wellcome Trust and Epic Games ran a challenge - pairing computer games developers with researchers - to develop visualisation methods in virtual reality (VR) for big biomedical datasets from the following Wellcome Trust funded research projects: 1. A collection of historical medical records from the Casebooks Project Genomics data from the Sanger Institute Cohort data from the ALSPAC study (also known as Children of the Nineties) University of Bristol researchers were paired with team Luma Pie (comprising Masters of Pie http://www.mastersofpie.com and Lumacode http://www.lumacode.com), who won the challenge with the vARC concept designed to visualise complex cohort data from the ALSPAC study. The Masters of Pie blog records vARC development (part 1, http://www.mastersofpie.com/big-data-vr-challenge/ and part 2 http://www.mastersofpie.com/big-data-vr-challenge-phase-2-update/), and a description of the challenge winning vARC prototype ( http://www.mastersofpie.com/big-data-vr-challenge-winners/). The value of this emerging technology and its potential applications to e-health and wider use in medicine was recognised by the winning collaboration who continue the development of a VR proof-of-concept biomedical data exploration and visualisation tool under the Big Data VR project using the ALSPAC cohort study as a use case. This project has additionally explored a variety of VR visual analytic methodologies, investigated VR analytics applied to different scales of data and scoped the integration of privacy protecting analytical methods via DataSHIELD [1]. Findings will be reported in a forthcoming paper. Due to the nature of the Big Data VR project, it was necessary to use a dataset that could be freely shared across the project team of researchers and games developers, as well as be deployed as an open test dataset for a demo release of the Big Data VR tool. There exist, however, ethical-legal constraints on the open sharing of, or access to, biomedical study data due to concerns around participant privacy and disclosure risk. ALSPAC deploys a rigorous data governance and access policy to protect participant data confidentiality and disclosure. This meant that we could not simply share real ALSPAC data with the developers without going through a potentially lengthy process of formally assessing the bona-fides of every single person in the development team who may need to work with or see the data. Given the very short time scale of the project this was not feasible. However, in order to properly challenge the developers and their evolving tools, and to ensure that the tools would ultimately be useful in a meaningful scientific context, it was nevertheless important that the test datasets closely mirrored real ALSPAC data. To ensure privacy protection, it was therefore necessary to generate synthetic datasets to be used in the project, an approach commonly utilised within the research health data domain [2]. This paper outlines three synthetic datasets simulated from ALSPAC study data for the purposes of the Big Data VR project.

Methods

Based at the University of Bristol, ALSPAC (also known as Children of the 90s) studies the health and well-being of pregnancies from the Avon region - with children born between 1991–1992. The whole cohort includes children from original enrolment (phase I recruitment), as well as children invited to join from the age of 7 onwards (from phase II and III recruitment), n = 15445 participants (excluding triplets and quadruplets) at the time of this work. Cohort profiles are described in Boyd et al. [3], Fraser et al. [4] and the study website contains details of all the data that is available through a fully searchable data dictionary ( http://www.bris.ac.uk/alspac/researchers/data-access/data-dictionary/). The variables from 15445 ALSPAC child participants used for the simulated data generation are outlined in Table 1. They include cardiac measures (i.e. blood pressure and pulse rate) and anthropometric measures (i.e. height, sitting height, weight, bmi, hip and waist circumference) of children visiting different ALSPAC clinics. The age indicated at each clinic is the age of the child at attendance, which is calculated from the date of the visit and the child’s date of birth. All variables used were continuous, except gender which is a binary variable (with 1 indicating male and 2 indicating female). The coverage of these variables at different clinic ages is shown in Table 2, highlighting any variables missing from collection.
Table 1.

A description of the ALSPAC variables used to generate the simulated datasets.

ALSPAC Variable NameDescriptionSimulated Variable Name
kz021Sexsex
f7ms010Height (cm): F@7height.7
f7ms012Sitting height (cm): F@7height.sit.7
f7ms018Waist circumference (cm): F@7waist.7
f7ms020Hip circumference (cm): F@7hip.7
f7ms026Weight (kg): F@7weight.7
f7ms026aBMI: F@7BMI
f7sa021Mean BP systolic: samples F@7sbp.7
f7sa022Mean BP diastolic: samples F@7dbp.7
f7sa023Mean Pulse: samples F@7pulse.7
f7003cAge (months) at Focus @ 7 visitage.7
f8lf020Child height (cm): LF, F@8height.8
f8lf021Child weight (kg): LF, F@8weight.8
f8003cAge (months) at Focus @ 8 visitage.8
f9ms010Height (cm): F@9height.9
f9ms012Sitting height (cm): F@9height.sit.9
f9ms018Waist circumference (cm): F@9waist.9
f9ms020Hip circumference (cm): F@9hip.9
f9ms026Weight (kg): F@9weight.9
f9ms026aBMI: F@9BMI.9
f9sa021Mean BP systolic: samples F@9sbp.9
f9sa022Mean BP diastolic: samples F@9dbp.9
f9sa023Mean Pulse: samples F@9pulse.9
f9003cAge (months) at Focus @ 9 visitage.9
fdms010Height (cm): F10+height.10
fdms012Sitting height (cm): F10+height.sit.10
fdms018Waist circumference (cm): F10+waist.10
fdms026Weight (kg): F10+weight.10
fdms026aBMI: F10+BMI.10
SBPSystolic blood pressure_ASsbp.10
DBPDiastolic blood pressure_ASdbp.10
fd003cAge (months) at F10+ visitage.10
fems010Height (cm): F11+height.11
fems012Sitting height (cm): F11+height.sit.11
fems018Waist circumference (cm): F11+waist.11
fems020Hip circumference (cm): F11+hip.11
fems026Weight (kg): F11+weight.11
fems026aBMI: F11+BMI.11
fesa021Mean BP systolic: samples F11+sbp.11
fesa022Mean BP diastolic: samples F11+dbp.11
fesa023Mean Pulse: samples F11+pulse.11
fe003cAge (months) at F11+ visitage.11
ff2000M5: Height (cms)height.12
ff2005M7: Sitting height (cms)height.sit.12
ff2020M11: Waist circumference (cms)waist.12
ff2620B8: BP result 1 - systolicsbp.12
ff2621B9: BP result 1 - diastolicdbp.12
ff2622B10: BP result 1 - pulsepulse.12
ff0011aDV: Age of study child at attendance (months)age.12
fg3100M5: Height (cms) : TF2height.13
fg3120M11: Waist circumference (cms) : TF2waist.13
fg3130M15: Weight (Kgs) : TF2weight.13
fg6120B15: BP result 1 - systolic : TF2sbp.13
fg6121B16: BP result 1 - diastolic : TF2dbp.13
fg6122B17: BP result 1 - pulse : TF2pulse.13
fg0011aDV: Age of study child at attendance (months) TF2age.13
fh3000M5: Height (cms) : TF3height.15
fh3010M15: Weight (Kgs) : TF3weight.15
fh4020M11: Waist circumference (cms) : TF3waist.15
fh4030V6: Sitting height (cms) : TF3height.sit.15
fh2030AC18: BP result 1 - systolic : TF3sbp.15
fh2031AC19: BP result 1 - diastolic : TF3dbp.15
fh2032AC20: BP result 1 - pulse : TF3pulse.15
fh0011aDV: Age of study child at attendance (months) TF3age.15
FJMR020M5: Height (cms) [F17]height.17
FJMR022M15: Weight (kgs) [F17]weight.17
FJAR020adv: Right arm BP mean: systolicsbp.17
FJAR020bdv: Right arm BP mean: diastolicdbp.17
FJAR020cdv: Right arm BP mean: pulsepulse.17
FJMR022adv: BMI [F17]bmi.17
FJ003aAge in months at clinic visit [F17]age.17
Table 2.

A summary of data capture in clinics for the respective ALSPAC variables.

Variable (units)F@7F@8F@9F@10F@11TF1TF2TF3TF4
Gender (1 male, 2 female)yesyesyesyesyesyesyesyesyes
Exact Age (months)yesyesyesyesyesyesyesyesyes
Height (cm)yesyesyesyesyesyesyesyesyes
Sitting Height (cm)yesNAyesyesyesyesNAyesNA
Waist Circumference (cm)yesNAyesyesyesyesyesyesNA
Hip Circumference (cm)yesNAyesNAyesNANANANA
Weight (kg)yesyesyesyesyesyesyesyesyes
Systolic Blood Pressure (mmHg)yesNAyesyesyesyesyesyesyes
Diastolic Blood Pressure (mmHg)yesNAyesyesyesyesyesyesyes
Pulse (Beats per minute)yesNAyesNAyesyesyesyesyes
BMI (kg/m2)yesNAyesyesyesNANANAyes
Synthetic data was simulated using the statistical programming language R ( 5, version 3.2.3) comprising the following steps with the corresponding R functions noted in line:

Data cleaning

The ALSPAC dataset described in Table 1 and Table 2 was cleaned by removing all rows with missing values, leaving 1593 observations remaining.

Standardising continuous variables

Each continuous variable, x, was standardised using the the z-score transformation: where z denotes the standardised version of the variable, with µ and σ representing the mean and standard deviation of x , respectively (using mean() and sd()). This z-score transformation was used to transform normally distributed data N ( µ, σ) to standard normally distributed data N (0, 1).

Data generation: Continuous variables

It was assumed that the continuous variables (excluding BMI) follow an approximate multivariate normal distribution. Using the pseudo-random multivariate normal generator ( mvrnorm()), three synthetic datasets were generated of observation sizes 15500, 155000 and 1550000 participants. Using the assumption of approximate multivariate normality (without transforming any non-normal data to normal), the synthetic data do not have precisely the same joint and marginal distributions as the original ALSPAC data, but they have very accurate approximations with most variables passing formal tests of normality. The simulated continuous variables were then rescaled back to their original mean and standard deviation by the inverse z-score transformation: where X and Z denote the simulated data for x and z respectively, with µ and σ representing (as above) the mean and standard deviation of the real x data.

Data generation: Binary variables

The simulated gender variable retains the same proportions of males and females as that in the original ALSPAC data set. This was achieved by converting the levels 1-2 (indicating males and females respectively) to 0-1 data and then applying a logistic model for gender regressed ( glm()) on all continuous variables using the original dataset. The estimated coefficients were then used to calculate the linear predictors of the simulated datasets. Then, using the log odds, y, from the linear predictors, we have calculated the odds, p, that indicate the probability ratio between males and females, using the inverse logit (also known as expit) transformation: The simulated binary variable denoting gender in each subject was then generated using the value of p in that individual (derived from the expit transformation) as the probability argument in R’s rbinom() function.

Data generation: BMI variable

The simulated BMI variable was calculated from the simulated values of weight and height for the clinics F@7, F@9, F@10, F@11 and TF4 using the relationship

Data generation: Age variable

The age at each clinic, initially reported in months, was divided by 12 to represent its values in years. The simulated age variable at each clinic was generated assuming normality and using the rnorm() R function with mean and variance set equal to the actual mean and variance of age at each clinic.

Dataset validation

Three synthetic datasets were simulated using the methodology described above - with observation size 15,500 participants (simulated.data.1.csv), 155,500 participants (simulated.data.2.csv) and 1,555,000 participants (simulated.data.3.csv). Table 1 shows the data dictionary for these. The three synthetic datasets have similar properties to the ALSPAC data they are simulated from. This is demonstrated by the close similarity of the estimated means, variances and covariance matrices for the relevant variables in the original ALSPAC dataset and the three synthetic datasets (see Supplementary material). The synthetic datasets contain none of the original data itself. The ALSPAC dataset (project number B2506) these synthetic data are simulated from can be obtained from ALSPAC through the standard ALSPAC research proposal and data access policy http://www.bristol.ac.uk/alspac/researchers/access/. The script to generate the three synthetic datasets. https://doi.org/10.5281/zenodo.817502 [6] The synthetic data described in this paper are available at the University of Bristol data repository, data.bris, at https://doi.org/10.5523/bris.3116aupg8mfgi23pnslu8tulev [7]

Ethical statement

Ethical approval for the study was obtained from the ALSPAC Ethics and Law Committee and the Local Research Ethics Committees. A comprehensive list of research ethics committee approval references is available to download at: http://www.bristol.ac.uk/ alspac/researchers/research-ethics/. The authors have created a dummy dataset that has similar properties to the real dataset but none of the information governance issues in terms of sharing the data. This will be a great example dataset for teaching purposes as it has no IG restrictions and contains no outliers or missing values. It could also be used for research into data handling, storing or visualisation as suggested in the main body of the article. The data was easily accessible in .CSV files and the files are accompanied by a data dictionary which describes the data well. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. This report describes the creation of three simulated datasets using the ALSPAC (Children of the Nineties) cohort. The data are easily accessible, and the provision of the links to the 'Masters of Pie' blog helps provide some interesting context for the work. Some very minor points for the authors' consideration: It is mentioned that the VR project had a very short time scale - it might be useful to provide some information about the time scale, and how this relates to a typical turnaround time for a formal data request using the usual ALSPAC process. Could the authors provide a little more information to explain why they created three synthetic datasets of different sizes? Why 15500, 155000 and 1550000? Table 2 suggests that weight and height were available at all timepoints. However, Table 2 lists 'NA' for BMI at various timepoints - if weight and height were available, why was BMI not derived? Finally, in Supplementary File 2 (variable variance), the variances for the real ALSPAC data are really similar to the variances of the simulated data for most of the variables. The variances for BMI seem to be a little different, with the variances in the simulated datasets being a little higher at all ages across the simulated datasets. Perhaps the authors could comment on this observation? I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. This article describes the process of creating a number of synthetic datasets to underpin a virtual reality visual analysis project. The process is described fully, with working links to appropriate information about the synthetic datasets (and the script used to generate these) and to information about the original ALSPAC data. The importance of the article reflects the fact that it documents an approach to generating synthetic data for use in cases where: a) it is not possible, for ethical and legal reasons, to share the survey or administrative data on which the synthetic data is based; but b) where the synthetic data created needs to mimic the data from which they were derived. A few minor editorial comments for the authors to consider: The article describes the creation of three synthetic datasets of different sizes, but does not explain the rationale behind the need for these three datasets. More information on this would be useful. It would be helpful to introduce the 'Big Data VR' project in the first para of the introduction - it is currently not immediately clear that this is the same thing as the Wellcome & Epic Games 'challenge' (which is the terminology initially used in these introductory sentences). The Methods/data cleaning description states that the original ALSPAC dataset was 'cleaned by removing all rows with missing values'. I presume this should read 'by removing all rows with ANY missing values'? I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
  2 in total

1.  Cohort Profile: the Avon Longitudinal Study of Parents and Children: ALSPAC mothers cohort.

Authors:  Abigail Fraser; Corrie Macdonald-Wallis; Kate Tilling; Andy Boyd; Jean Golding; George Davey Smith; John Henderson; John Macleod; Lynn Molloy; Andy Ness; Susan Ring; Scott M Nelson; Debbie A Lawlor
Journal:  Int J Epidemiol       Date:  2012-04-16       Impact factor: 7.196

2.  Cohort Profile: the 'children of the 90s'--the index offspring of the Avon Longitudinal Study of Parents and Children.

Authors:  Andy Boyd; Jean Golding; John Macleod; Debbie A Lawlor; Abigail Fraser; John Henderson; Lynn Molloy; Andy Ness; Susan Ring; George Davey Smith
Journal:  Int J Epidemiol       Date:  2012-04-16       Impact factor: 7.196

  2 in total
  2 in total

1.  Privacy preserving data visualizations.

Authors:  Demetris Avraam; Rebecca Wilson; Oliver Butters; Thomas Burton; Christos Nicolaides; Elinor Jones; Andy Boyd; Paul Burton
Journal:  EPJ Data Sci       Date:  2021-01-07       Impact factor: 3.184

2.  A deterministic approach for protecting privacy in sensitive personal data.

Authors:  Demetris Avraam; Elinor Jones; Paul Burton
Journal:  BMC Med Inform Decis Mak       Date:  2022-01-28       Impact factor: 2.796

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.