Literature DB >> 23113127

Multiple Imputation to Deal with Missing Clinical Data in Rheumatologic Surveys: an Application in the WHO-ILAR COPCORD Study in Iran.

M Mirmohammadkhani¹, A Rahimi Foroushani, F Davatchi, K Mohammad, A Jamshidi, A Tehrani Banihashemi, K Holakouie Naieni.

Abstract

BACKGROUND: The aim of the article is demonstrating an application of multiple imputation (MI) for handling missing clinical data in the setting of rheumatologic surveys using data derived from 10291 people participating in the first phase of the Community Oriented Program for Control of Rheumatic Disorders (COPCORD) in Iran.
METHODS: Five data subsets were produced from the original data set. Certain demographics were selected as complete variables. In each subset, we created a univariate pattern of missingness for knee osteoarthritis status as the outcome variable (disease) using different mechanisms and percentages. The crude disease proportion and its standard error were estimated separately for each complete data set to be used as true (baseline) values for percent bias calculation. The parameters of interest were also estimated for each incomplete data subset using two approaches to deal with missing data including complete case analysis (CCA) and MI with various imputation numbers. The two approaches were compared using appropriate analysis of variance.
RESULTS: With CCA, percent bias associated with missing data was 8.67 (95% CI: 7.81-9.53) for the proportion and 13.67 (95% CI: 12.60-14.74) for the standard error. However, they were 6.42 (95% CI: 5.56-7.29) and 10.04 (95% CI: 8.97-11.11), respectively using the MI method (M=15). Percent bias in estimating disease proportion and its standard error was significantly lower in missing data analysis using MI compared with CCA (P< 0.05).
CONCLUSION: To estimate the prevalence of rheumatic disorders such as knee osteoarthritis, applying MI using available demographics is superior to CCA.

Entities: Chemical Disease Gene Species

Keywords: COPCORD; Imputation; Missing Data; Osteoarthritis; Rheumatology

Year: 2012 PMID： 23113127 PMCID： PMC3481659

Source DB: PubMed Journal: Iran J Public Health ISSN： 2251-6085 Impact factor: 1.429

Introduction

Missing data is an unavoidable challenge in most epidemiologic researches and occurs under various mechanisms (1). Missing completely at random (MCAR) refers to a condition where missingness is not related to the studied variables. In missing at random (MAR), data is missing at random conditionally, and although unrelated to the variable of interest, it is related to other study variables. Missing not at random (MNAR) is the case where missingness depends on the values of the variable of interest (2–3). In cross-sectional surveys like any other type of observational studies, missing data can be due to incomplete responses and low rate of respondents’ cooperation (4–5). However, probability of missingness is not equal for all variables; those collected by methods that are less costly and less reliant on participant cooperation are also less likely to have missing data. For example, demographics can be collected through simple approaches which are less dependent on subject participation, while clinical data such disease status would at least require taking a medical history and performing physical exam, and in some cases, it may be possible only by utilizing expensive, invasive or time consuming diagnostic procedures as well as subject consent and participation in every stage. Rheumatologic studies also are not exempt. As a typical example, we can refer to the first phase of the Community Oriented Program for Control of Rheumatic Disorders (COPCORD) performed in Tehran the capital of Iran in 2005 by the Rheumatology Research Center of Tehran University of Medical Sciences in collaboration with the World Health Organization (WHO) and the International League of Associations for Rheumatology (ILAR) to determine the pattern of rheumatic complaints and disorders in this region. As the first step of data gathering procedure, a short preliminary interview was performed by trained health care providers to find eligible individuals in each random selected household considering their demographic characteristics. Then, selected participants were approached at their homes to gather main clinical data on their rheumatic complaints and disorders through verbal interview, and consenting participants had a physical exam and diagnostic tests by trained physicians and clinicians. In case they were absent from home, attempts were repeated for up to two more times before being excluded from the study. From 13741 eligible people, 582 individuals (4.23%) refused to participate and 2868 subjects (20.87%) were not reached despite multiple attempts. Eventually, we had data on demographics of 13741 participants, but clinical data on 3450 people (25.11%) was missing. Data was collected on the first or second attempt in 8401 cases (81.6%), and the third attempt for 1890 cases (18.4 %). A more detailed methodology is presented elsewhere (6–10). Considering the context of a study, percentage of missing data and the missing data mechanism influence the level of errors due to missingness. Fortunately, in many situations it is possible to reduce bias and provide more precise findings to deal with missing data using special methods and software (11). Some techniques are based on data repair in which the missing values are imputed with appropriate values based on observed data. In some imputing methods, the missing value is imputed with just one fitted value. These methods are classified as single imputation (SI). As an important drawback, no variation is assumed for the fitted value in SI, and this leads to an overestimated study precision (12). However in multiple imputation (MI), contrary to SI, fitted values are permitted to vary. MI is popular with researchers in various fields as a novel and efficient method (13). In brief, the method has three phases. In the first phase, the missing data are imputed M times to get M complete datasets. In the second phase, these datasets are analyzed separately, using methods of interests. In the final step, the results obtained from M analyses are combined to draw a single inference using certain rules (14). Flexibility and efficiency are the most prominent characteristics of MI which make it more favorite (15). Although MI has been utilized in many fields increasingly (16), the method is quite uncommon in some fields of epidemiology despite its advantages, especially in large data sets (5, 17–19). One question needs to be replied; in COPCORD ongoing studies, when the aim of the study is to determine the prevalence of musculoskeletal disorders, can we suggest using the MI as a novel imputing method instead of making inferences after excluding incomplete cases and running complete case analysis (CCA)? The objective of the present analysis was to contrast CCA and MI across three missing data mechanisms and different proportions of missing data in the context of COPCORD study.

Materials and Methods

We used the data of 10291 participants in the first phase of the COPCORD in Iran with complete demographic and clinical data. Presence or absence of knee osteoarthritis, as an important and typical rheumatic disease, was set as outcome variable subject to missingness. Of them, 1532 (14.89%) people were diagnosed with knee osteoarthritis. Age, sex, marital status, education level, and occupation were considered as variables with no missing value. Using all observations of whole data set, five independent data subsets were generated based on city zones (north, south, west, east and center) to make inferences (Table 1).

Table 1:

Population, Number of knee osteoarthritis cases and accepted true values of the parameters of interest (proportion and standard error) in each generated data subset

Data subset	Population (No. of Cases)	True values (V_t)
Data subset	Population (No. of Cases)	Proportion (%)	Standard error
1	768(103)	13.4	0.012
2	2965(469)	15.8	0.007
3	1809(271)	15.0	0.008
4	2911(445)	15.3	0.007
5	1838(244)	13.3	0.008
Total	10291(1532)	14.9	0.003

To create a univariate pattern of missing data, we deleted some entries for the outcome variable (diagnosis of knee osteoarthritis) in each database while all observations related to demographics were retained. In this step 10, 15, 20 and 25 percent of values related to disease status were dropped from the database of each data subset separately and independently following three mechanisms. 1) MCAR. 2) Random deletion of entries indicating absence of disease (with no knee osteoarthritis) assuming a lower participation rate for healthier people. This mechanism can be considered MNAR. 3) Random deletion of entries collected on the third (and second) attempts regarding to real situation of missing data in the COPCORD study context. We named this mechanism as non-response. Hence 60 incomplete data subsets were generated. The parameters of interest were the crude proportion of knee osteoarthritis and its standard error. In incomplete databases, estimates were calculated by two approaches; CCA which is done after deleting cases with missing values, and MI in which missing values were imputed M times (in our study 5, 10, 15, and 20 times) using all other observed values. Therefore, we had 5 complete data subsets and 60 with missing data. After dealing with missing values, we had 300 analyses, 60 of which were treated with CCA and 240 with MI where M ranged between 5 and 20. For MI, we utilized Stata’s ice command to perform multiple imputation by chained equations (MICE) which imputes missing values by using switching regression, an iterative multivariable regression technique. Estimated parameters from each complete database were set as accepted true values (Vt). For estimation of crude disease proportion and its standard error we used the Wald method for binomial distribution (presence or absence of knee osteoarthritis) (Table 1). We calculated the percent bias associated with missing data for disease proportion and its standard error in all 300 data subsets in which missingness was created. To measure the percent bias, we considered the absolute difference between the value obtained by analysis of each incomplete data subset (Vi) and the corresponding Vt expressed as a percentage of the true value (percent bias = 100 * [|Vi – Vt | / Vt ]). We compared the percent bias remaining after applying MI and CCA, as well as the effect of other factors of interest including the percentage of missing data (ranging between 5 and 25), the missing data mechanism (including MCAR, MNAR and non-response), and the interaction among them. For this purpose, we utilized analysis of variance (ANOVA) with general linear model (GLM) for repeated measures in the SPSS software version 16. The procedure provides analysis of variance when the same measurement is made several times on each subject in different periods or conditions. In our study, each one of the 300 databases was a special case of the main five generated data subsets. All tests were done at a confidence level of 95%, and Bonferroni correction was applied in multiple comparisons.

Results

In our analysis, the grand mean of percent bias associated with missing data was 6.94 (95% CI: 6.55–7.33) for the disease proportion and 10.99 (95% CI: 10.51–11.46) for the standard error. Figure 1 shows mean percent bias for the proportion (A and C) and standard error (B and D) with different missing data mechanisms (A and B) and missing data handling methods (C and D) separately for each data subset.

Fig. 1:

Percent bias for the proportion (A and C) and standard error (B and D) with different missing data mechanisms (A and B) and missing data handling methods (C and D) separately for each data subset.(MNAR=Missing not at random, MCAR=Missing completely at random, MI=Multiple Imputation, CCA= Complete case analysis)

The effect of missing data handling methods on percent bias using the ANOVA is summarized in Table 2. The percent bias in estimating the disease proportion was significantly affected in the data subsets (P<0.05), but not its standard error (P> 0.2). In estimating the proportion, the missing data mechanism (P=0.01) and missing data percentage (P<0.001) as well as their interaction (P<0.001) had significant effects on bias. In estimating the standard error, bias was significantly affected by missing data percentage (P<0.001) and its interaction with missing data mechanism (P<0.001), but not by the missing data mechanism per se (P=0.07). The missing data handling method significantly affected the bias in estimating both the proportion (P=0.02) and its standard error (P<0.001). Comparing the values of Partial Eta Squared revealed that in estimating the proportion, the interactive effect of missing data percentage and mechanism was greatest on the percent bias. But in estimating the standard error, the missing percentage was the most effective factor on percent bias.

Table 2:

Results of ANOVA to test the effects on the level of percent bias in estimating the parameters of interest (proportion and standard error)

Source of effects ^a	Proportion		Standard Error
Source of effects ^a	P	Partial Eta Squared	P	Partial Eta Squared
Data subset	<0.001	0.218	0.3	0.019
Data subset*Mechanism	0.001	0.243	0.3	0.046
Data subset*Method	0.007	0.239	0.3	0.088
Data subset*Percent	0.004	0.152	0.3	0.017
Data subsetMechanismPercent	0.003	0.212	0.3	0.050
Mechanism	0.01	0.164	0.07	0.097
Method	0.002	0.288	<0.001	0.397
Percent	<0.001	0.758	<0.001	0.869
Mechanism*Percent	<0.001	0.809	<0.001	0.378
Intercept	0.3	0.017	0.002	0.178

Lower-bound epsilon is used for adjustment to the numerator and denominator degrees of freedom in order to validate the univariate F statistic

Table 3 presents the percent bias and its confidence intervals for each missing data mechanism and for each handling method separately. The highest value pertained to the MNAR mechanism (16.56, 95% CI: 15.89–17.22 for proportion estimation and 14.42, 95% CI: 13.59–15.25 for standard error estimation) and the smallest value was with MCAR (1.99, 95% CI: 1.32–2.66 for proportion estimation and 8.96, 95% CI: 8.13–9.79 for standard error estimation). The value for non-response was between the above two (2.28, 95% CI: 1.61–2.95 for proportion estimation and 9.58, 95% CI: 8.75–10.41 for standard error estimation). As stated in Table 3, using CCA rather than MI to deal with missing data resulted in greater levels of percent bias. The highest percent bias values were with CCA (8.67, 95% CI: 7.81–9.53 for proportion estimation and 13.67, 95% CI: 12.60–14.74 for standard error estimation) and the smallest pertained to MI with M=15 (6.42, 95% CI: 5.56–7.29 for proportion estimation and 10.04, 95% CI: 8.97–11.11 for standard error estimation).

Table 3:

Percent bias and its 95% confidence interval in estimating the parameters of interest (proportion and standard error) regarding different missing data mechanisms and handling methods

Parameter	Method / Mechanism	Percent bias	95% Confidence Interval
Parameter	Method / Mechanism	Percent bias	Lower Bound	Upper Bound
Proportion	MI ^a (M=5)	6.515	5.651	7.379
	MI (M=10)	6.549	5.685	7.413
	MI (M=15)	6.424	5.561	7.288
	MI (M=20)	6.559	5.695	7.423
	CCA^b	8.670	7.807	9.534
	Non-response	2.283	1.614	2.952
	MNAR^c	16.556	15.886	17.225
	MCAR^d	1.992	1.323	2.661
Standard Error	MI (M=5)	10.338	9.268	11.408
	MI (M=10)	10.762	9.692	11.831
	MI (M=15)	10.042	8.972	11.112
	MI (M=20)	10.114	9.045	11.184
	CCA	13.673	12.603	14.743
	Non-response	9.578	8.750	10.407
	MNAR	14.420	13.591	15.249
	MCAR	8.959	8.131	9.788

Multiple Imputation,

Complete case analysis,

Missing not at random,

Missing completely at random

Table 4 presents the mean difference between CCA and MI with different imputation numbers (M) in terms of percent bias in estimating the parameters of interest; differences between the two approaches were statistically significant (P< 0.05) for every M in estimating both the proportion and the standard error. However no significant differences were found between various imputation numbers.

Table 4:

Mean difference in percent bias between CCA and MI with different imputation numbers (M) in estimating the parameters of interest (proportion and standard error)

Parameter	M	Difference in percent bias	P^a	95% Confidence Interval
Parameter	M	Difference in percent bias	P^a	Lower Bound	Upper Bound
Proportion	5	2.16	0.009	0.37	3.94
	10	2.12	0.010	0.33	3.91
	15	2.25	0.005	0.46	4.03
	20	2.11	0.01	0.32	3.90
Standard error	5	3.33	0.001	1.12	5.55
	10	2.91	0.003	0.70	5.12
	15	3.63	<0.001	1.42	5.84
	20	3.56	<0.001	1.35	5.77

Bonferroni adjustment

Table 5 demonstrates results of pairwise comparisons between percent biases for parameters of interest with different missing data mechanisms. Significant differences were found between MNAR and MCAR mechanisms in estimating both the proportion (P<0.001) and the standard error (P<0.001). In addition, there was a significant difference between MNAR mechanism and missing data due to non-response in estimating both parameters of interest (P<0.001), but not between MCAR and non-response (P>0.9).

Table 5:

Pairwise comparison of estimation percent bias for parameters of interest (proportion and standard error) as calculated with different missing data mechanisms

Parameter	I	J	Difference in percent bias (I-J)	P^a	95% Confidence Interval
Parameter	I	J	Difference in percent bias (I-J)	P^a	Lower Bound	Upper Bound
Proportion	Non-response	MNAR^b	−14.27	<0.001	−15.44	−13.11
		MCAR^c	0.29	1.00	−0.88	1.46
	MNAR	MCAR	14.56	<0.001	13.40	15.73
Standard Error	Non-response	MNAR	−4.84	<0.001	−6.29	−3.40
		MCAR	0.62	0.9	−0.83	2.06
	MNAR	MCAR	5.46	<0.001	4.01	6.91

Bonferroni adjustment,

Missing not at random,

Missing completely at random

Discussion

The main goal of performing research and statistical analysis of data is getting accurate and valid estimations about a population of interest. But, missing data often occurs and it may leave an undesirable impression on study results (20). If missing data can be assumed MCAR, ignoring it and running CCA can lead to loss of information, reduced sample size and diminished study precision (3). Subjects' unwillingness or lack of cooperation depends on numerous and complex causes which may associate with the study variables directly or indirectly. For example, in many situations, healthier individuals may be less inclined to participate or cooperate in a research. Therefore, the MCAR assumption is usually violated, and CCA not only reduces study precision, but also may lead to biased results. With MI, acceptable missing data mechanisms are MCAR or MAR (MAR assumption); otherwise it should be used and interpreted with caution (14). Since MAR assumption is not testable with observed data, it is potentially useful to consider perturbation (sensitivity) analysis of consequences of departure from the MAR assumption. The result of current analyses revealed that clinical missing data (diagnosis of knee osteoarthritis) can create bias in estimating both the crude proportion of disease and its standard error. The COP-CORD dataset had about 25% missing data; of the selected samples, 4.2% were unwilling to participate and 20.8% were unreachable despite 3 attempts. In the worst case scenario, if MNAR has to be assumed, running both CCA and MI can threaten study results by introducing serious bias. According to our findings, where missing data was mainly due to unreachability, the level of percent bias was not significantly different from that with MCAR. Therefore this finding suggests that clinical missing data due to unreachability should not be considered as MNAR, and thus applying MI is appropriate. However, the case is more complicated with those who refused to participate in the first place. Some reports suggest MI can provide acceptable results even when the missing data mechanism is MNAR, although its application requires expertise and caution (12, 20–24). Compared to CCA, our study results indicated significantly less percent bias with MI data analysis in all situations, and this is in agreement with abovementioned reports. In light of this observation, as a practical approach in settings similar to global COPCORD initiative, in which the objective is to determine the prevalence of musculoskeletal disorders such as osteoarthritis, we suggest making use of demographics such as age, gender, education, and occupation (which can be collected more convenient) to determine the disease status and applying MI rather than eliminating cases whose disease statuses are missing. It must be noted that applying MI may be questionable when the assumptions are violated, but it does not justify the use of CCA either. In other words, despite being more time consuming and complicated, MI seems to be a better choice than CCA even when we have MNAR data and expected biased estimates. The aim of MI is not making up data at all; on the contrary, it is making use of all observations to fix the database as far as possible. The more valid and appropriate the imputation and estimation model, the smaller the level of bias. In our study, the intrinsic correlation between knee osteoarthritis, as a typical musculoskeletal disorder, and any of the demographics can be a logical explanation for the ability of MI to reduce percent bias significantly, and validates its application. Based on the rules for MI, a minimum of 3 imputations are necessary before any interpretation of results, and theoretically, there is no upper limit. To be practical, most studies have used an M equal to 5 or 10 (25). In our study, the level of percent bias did not significantly decrease when we changed the number of imputations from 5 to 10, 15, or 20. MI is probably one of the most accurate and useful methods for handling missing data, nonetheless, it is not the best. In this study, we did not assess other missing data analysis methods, and thus, we can never recommend MI as the best approach for dealing with missing data in similar scenarios. However, MI is acceptably efficient and flexible, and considering the availability of user friendly statistical software, it seems only logical to use MI instead of the traditional CCA. We suggest comparing MI with other methods in this setting. We used a single clinical variable to avoid unwanted complexities. This issue influences the generalizability of study results. To minimize this effect, we chose a typical rheumatic disorder, i.e. knee osteoarthritis which can properly represent many musculoskeletal disorders in accordance with the analytic objectives of the study. We suggest conducting similar assessments with other variables in this setting. In summary, in a study setting similar to ours, neglecting clinical missing data can be a source of significant bias in estimating the prevalence of musculoskeletal disease such as knee osteoarthritis. A suitable option to reduce the impact of this issue and increase the accuracy of estimates is the use of MI to repair missing clinical data based on demographics, and it is a better choice than CCA.

Ethical considerations

Ethical issues (Including plagiarism, Informed Consent, misconduct, data fabrication and/or falsification, double publication and/or submission, redundancy, etc) have been completely observed by the authors.

23 in total

Review 1. WHO-ILAR COPCORD perspectives past, present, and future.

Authors: John Darmawan; Kenneth D Muirden
Journal: J Rheumatol Date: 2003-11 Impact factor: 4.666

Review 2. [COPCORD studies].

Authors: Rachel Riera; Rozana Mesquita Ciconelli; Marcos Bosi Ferraz
Journal: Acta Reumatol Port Date: 2006 Apr-Jun Impact factor: 1.290

3. How many imputations are really needed? Some practical clarifications of multiple imputation theory.

Authors: John W Graham; Allison E Olchowski; Tamika D Gilreath
Journal: Prev Sci Date: 2007-06-05

4. Multiple imputation: current perspectives.

Authors: Michael G Kenward; James Carpenter
Journal: Stat Methods Med Res Date: 2007-06 Impact factor: 3.021

5. [Multiple imputation for missing data: a brief introduction].

Authors: Michela Baccini
Journal: Epidemiol Prev Date: 2008 May-Jun Impact factor: 1.901

6. Multiple imputation as a missing data machine.

Authors: J Brand; S van Buuren; E M van Mulligen; T Timmers; E Gelsema
Journal: Proc Annu Symp Comput Appl Med Care Date: 1994

7. Issues in multiple imputation of missing data for large general practice clinical databases.

Authors: Louise Marston; James R Carpenter; Kate R Walters; Richard W Morris; Irwin Nazareth; Irene Petersen
Journal: Pharmacoepidemiol Drug Saf Date: 2010-06 Impact factor: 2.890

8. Multiple imputation in a large-scale complex survey: a practical guide.

Authors: Y He; A M Zaslavsky; M B Landrum; D P Harrington; P Catalano
Journal: Stat Methods Med Res Date: 2009-08-04 Impact factor: 3.021

9. WHO-ILAR COPCORD Study (Stage 1, Urban Study) in Iran.

Authors: Fereydoun Davatchi; Ahmad-Reza Jamshidi; Arash Tehrani Banihashemi; Jaleh Gholami; Mohammad Hossein Forouzanfar; Massoomeh Akhlaghi; Mojgan Barghamdi; Elham Noorolahzadeh; Ali-Reza Khabazi; Mansoor Salesi; Amir-Hossein Salari; Mansoor Karimifar; Kamal Essalat-Manesh; Mehrzad Hajialiloo; Mohsen Soroosh; Farhad Farzad; Hamid-Reza Moussavi; Farideh Samadi; Koorosh Ghaznavi; Homa Asgharifard; Amir-Hossein Zangiabadi; Farhad Shahram; Abdolhadi Nadji; Mahmood Akbarian; Farhad Gharibdoost
Journal: J Rheumatol Date: 2008-05-01 Impact factor: 4.666

10. A comparison of imputation techniques for handling missing predictor values in a risk model with a binary outcome.

Authors: Gareth Ambler; Rumana Z Omar; Patrick Royston
Journal: Stat Methods Med Res Date: 2007-06 Impact factor: 3.021

1 in total

1. Evaluation of Multiple Imputation with Large Proportions of Missing Data: How Much Is Too Much?

Authors: Jin Hyuk Lee; J Charles Huber
Journal: Iran J Public Health Date: 2021-07 Impact factor: 1.429

1 in total