| Literature DB >> 35893317 |
Rushani Wijesuriya1,2, Margarita Moreno-Betancur1,2, John Carlin1,2,3, Anurika Priyanjali De Silva3, Katherine Jane Lee1,2.
Abstract
Three-level data arising from repeated measures on individuals clustered within higher-level units are common in medical research. A complexity arises when individuals change clusters over time, resulting in a cross-classified data structure. Missing values in these studies are commonly handled via multiple imputation (MI). If the three-level, cross-classified structure is modeled in the analysis, it also needs to be accommodated in the imputation model to ensure valid results. While incomplete three-level data can be handled using various approaches within MI, the performance of these in the cross-classified data setting remains unclear. We conducted simulations under a range of scenarios to compare these approaches in the context of an acute-effects cross-classified random effects substantive model, which models the time-varying cluster membership via simple additive random effects. The simulation study was based on a case study in a longitudinal cohort of students clustered within schools. We evaluated methods that ignore the time-varying cluster memberships by taking the first or most common cluster for each individual; pragmatic extensions of single- and two-level MI approaches within the joint modeling (JM) and the fully conditional specification (FCS) frameworks, using dummy indicators (DI) and/or imputing repeated measures in wide format to account for the cross-classified structure; and a three-level FCS MI approach developed specifically for cross-classified data. Results indicated that the FCS implementations performed well in terms of bias and precision while JM approaches performed poorly. Under both frameworks approaches using the DI extension should be used with caution in the presence of sparse data.Entities:
Keywords: clustered data; cross-classified data; missing data; multiple imputation; three-level data; time-varying cluster memberships
Mesh:
Year: 2022 PMID: 35893317 PMCID: PMC9540355 DOI: 10.1002/sim.9515
Source DB: PubMed Journal: Stat Med ISSN: 0277-6715 Impact factor: 2.497
Description of variables of interest for the individual at wave , in the motivating case study in the CATS
| Variable | Type | Grouping/range | Label |
|---|---|---|---|
| Child's sex | Categorical | 0 = Female 1 = Male |
|
| Child's age (wave 1) (years) | Continuous | Range [7–11] |
|
| Standardized SES measured by the SEIFA IRSAD (wave 1) | Continuous | z‐score |
|
| Teacher's numeracy rating (wave 1) | Continuous |
Range [1–5] |
|
|
| Continuous |
Range [1–5] |
|
|
|
Continuous |
Range [0–8] |
|
|
| Continuous | Range [0–40] |
|
Abbreviations: IRSAD, index of relative socio‐economic advantage and disadvantage; SDQ, strengths and difficulties questionnaire; SEIFA, socioeconomic index for areas; SES, socio‐economic status.
The rating provided by the classroom teacher assessing students mathematical skills which is measured on a 5 point Likert scale.
A subset of 4 items (each ranging from 0 to 2) from the Short Mood and Feelings Questionnaire (SMFQ) was used to measure the depressive symptoms at each wave in the CATS. , The exposure measure (at each wave) was the total summary score of these four items.
Derived from the first 4 subscales of the SDQ: emotional symptoms, conduct problems, hyperactivity/inattention, peer relationship problems (each ranging from 0 to 10). This variable was not included in the analysis but was included in the imputation model as an auxiliary variable to improve its performance.
Missing data frequencies and proportions in the outcome, teacher rating score, the exposure, depressive symptom score, and the auxiliary variable,child behavior problems reported by SDQ, by wave in the substantive analysis (n = 1168)
| Data collection wave | Teacher rating score frequency (%) | Depressive symptom score frequency (%) | Child behavior problems reported by the SDQ |
|---|---|---|---|
| 1 | 87 (7%) | 74 (6%) | 31 (3%) |
| 2 | 118 (10%) | 82 (7%) | 317 (27%) |
| 3 | 103 (9%) | 80 (7%) | 290 (25%) |
| 4 | 130 (11%) | 119 (10%) | 377 (32%) |
Summary of the evaluated MI approaches for handling incomplete three‐level cross‐classified data
| How the two sources of clustering are handled | |||||
|---|---|---|---|---|---|
| MI approach | Type | Software | Clustering due to higher level clusters | Clustering due to repeated measures | How the time‐varying cluster memberships are handled |
|
|
Standard (single‐level) | SAS, SPSS, Stata, Mplus, R | DI | Repeated measures imputed in wide format | Time‐varying nature of cluster memberships is ignored and only the DI for cluster membership at the first time point is used for all time points |
|
| Time‐varying nature of cluster memberships is ignored and only the DI for the most common cluster membership over the course of the study is used for all time points | ||||
|
|
Standard (single‐level) | SAS, SPSS, Stata, Mplus, R, Blimp | DI | Repeated measures imputed in wide format | Restricting the univariate imputation models specified for each incomplete repeated measure to just include the DI at that particular wave |
|
| Time‐varying nature of cluster memberships is ignored and only the DI for cluster membership at the first time point is used for all time points | ||||
|
| Time‐varying nature of cluster memberships is ignored and only the DI for the most common cluster membership over the course of the study is used for all time points | ||||
|
| Specialized for two levels | Mplus, R, Blimp | RE | Repeated measures imputed in wide format | Restricting the univariate imputation models specified for each incomplete repeated measure to just include the RE for the cluster at that particular wave |
|
| Specialized for two levels | R, Realcom‐impute, Stat‐JR | DI | RE | Including the relevant DI representing the cluster membership at each time‐point in long format |
|
| Specialized for two levels | R | DI | RE | |
|
| Specialized for three levels | R | RE | RE | Through time‐varying REs for the clusters (ie, using a series of univariate CCREMs for imputation in FCS) |
|
| Specialized for three levels | R, Blimp | RE | RE | Time‐varying nature of cluster memberships is ignored and RE for the cluster group at first time‐point is used at all time points |
|
| Specialized for three levels | R, Blimp | RE | RE | Time‐varying nature of cluster memberships is ignored and RE for the most common cluster membership over the course of the study is used at all time points |
Simulation scenarios
| Base‐case | Smaller sample size | A higher number of waves | Small number of constant set of clusters | |
|---|---|---|---|---|
| Number of school clusters | 40 | 40 | 40 | 10 |
| School cluster sizes | varying (ranging from 8–66) | varying (ranging from 8–66) | varying (ranging from 8–66) | constant size of 120 |
| Total sample size | 1200 | 300 | 1200 | 1200 |
| Number of school clusters added at each wave | 10 | 10 | 10 | 10 |
| Number of waves of data collection | 4 | 4 | 8 | 4 |
FIGURE 1Distribution of the deviations of the estimated regression coefficient of interest from true value (β1,true value = −0.02) across the 1000 simulated datasets from available case analysis (ACA) and the 11 multiple imputation (MI) approaches under two scenarios of different number of higher level clusters and four ICC combinations when data are missing at random with dependencies based on CATS (MAR‐CATS)
FIGURE 2Distribution of deviations of the estimated regression coefficient of interest from true value across the 1000 simulated datasets for available case analysis (ACA) and 11 multiple imputation (MI) approaches with (A) a sample size of 300 (40 schools of varying school sizes at wave 1 and 10 new schools being added at each additional wave of data collection, and = −0.05) (B) 8 waves of data collection (40 school clusters of varying school sizes at wave 1 and 10 new schools being added at each additional wave of data collection, and = −0.02) (C) a small number of constant set of clusters (ie, no additional clusters being added at each wave, and = −0.02) when data are missing at random with inflated dependencies (MAR‐inflated)
FIGURE 3Empirical standard errors (filled circles with error bars showing ±1.96× Monte Carlo standard errors) and average model‐based standard errors (hollow circles) for the regression coefficient of interest from 1000 replications, for available case analysis (ACA) and the 11 multiple imputation (MI) approaches under two scenarios of number of higher level clusters and four ICC combinations when data are missing at random with dependencies based on CATS (MAR‐CATS)
FIGURE 4Empirical standard errors (filled circles with error bars showing ±1.96× Monte Carlo standard errors) and average model‐based standard errors (hollow circles) for the regression coefficient of interest from 1000 replications, for available case analysis (ACA) and the 11 multiple imputation (MI) approaches with (A) a sample size of 300 (40 schools of varying school sizes at wave 1 and 10 new schools being added at each additional wave of data collection, and = −0.05) (B) 8 waves of data collection (40 school clusters of varying school sizes at wave 1 and 10 new schools being added at each additional wave of data collection, and = −0.02) (C) a small number of constant set of clusters (ie, no additional clusters being added at each wave, and = −0.02) when data are missing at random with inflated dependencies (MAR‐inflated)
FIGURE 5Distribution of the deviations of the variance component estimates from true values across the 1000 simulated datasets from available case analysis (ACA) and the 11 multiple imputation (MI) approaches under the simulation scenario with a higher number of higher‐level clusters (addition of 50 new clusters at each wave) and data are missing at random with dependencies based on CATS (MAR‐CATS)
FIGURE 6Distribution of the deviations of the variance component estimates from true values across the 1000 simulated datasets from available case analysis (ACA) and the 11 multiple imputation (MI) approaches under the simulation scenario with a lower number of higher‐level clusters (addition of 10 new clusters at each wave) and data are missing at random with dependencies based on CATS (MAR‐CATS)
Point estimate (and standard error) for the effect of depressive symptoms at the previous wave on the teacher numeracy scores, and point estimates for the variance components at levels 3, 2, and 1, from available case analysis (ACA) and MI approaches applied to the CATS data
| Method | Regression coefficient estimate (SE) | Level 3 variance component | Level 2 variance component | Level 1 variance component | |
|---|---|---|---|---|---|
|
| −0.002 (0.007) | 0.005 | 0.229 | 0.276 | |
|
|
| *** | *** | *** | *** |
|
| −0.004 (0.007) | 0.007 | 0.226 | 0.285 | |
|
| −0.005 (0.008) | 0.006 | 0.194 | 0.307 | |
|
|
| *** | *** | *** | *** |
|
| −0.003 (0.007) | 0.008 | 0.229 | 0.287 | |
|
| −0.004 (0.007) | 0.003 | 0.228 | 0.281 | |
|
| −0.004 (0.007) | 0.009 | 0.227 | 0.286 | |
|
| −0.004 (0.007) | 0.009 | 0.227 | 0.286 | |
|
| *** | *** | *** | *** | |
|
| −0.002 (0.010) | 0.029 | 0.301 | 0.285 | |
|
| −0.005 (0.007) | 0.004 | 0.225 | 0.280 | |
Note: ***All JM approaches produced implausible estimates and are therefore omitted from the table.