Literature DB >> 27454392

Nearest neighbor imputation algorithms: a critical evaluation.

Lorenzo Beretta1, Alessandro Santaniello2.   

Abstract

BACKGROUND: Nearest neighbor (NN) imputation algorithms are efficient methods to fill in missing data where each missing value on some records is replaced by a value obtained from related cases in the whole set of records. Besides the capability to substitute the missing data with plausible values that are as close as possible to the true value, imputation algorithms should preserve the original data structure and avoid to distort the distribution of the imputed variable. Despite the efficiency of NN algorithms little is known about the effect of these methods on data structure.
METHODS: Simulation on synthetic datasets with different patterns and degrees of missingness were conducted to evaluate the performance of NN with one single neighbor (1NN) and with k neighbors without (kNN) or with weighting (wkNN) in the context of different learning frameworks: plain set, reduced set after ReliefF filtering, bagging, random choice of attributes, bagging combined with random choice of attributes (Random-Forest-like method).
RESULTS: Whatever the framework, kNN usually outperformed 1NN in terms of precision of imputation and reduced errors in inferential statistics, 1NN was however the only method capable of preserving the data structure and data were distorted even when small values of k neighbors were considered; distortion was more severe for resampling schemas.
CONCLUSIONS: The use of three neighbors in conjunction with ReliefF seems to provide the best trade-off between imputation error and preservation of the data structure. The very same conclusions can be drawn when imputation experiments were conducted on the single proton emission computed tomography (SPECTF) heart dataset after introduction of missing data completely at random.

Entities:  

Mesh:

Year:  2016        PMID: 27454392      PMCID: PMC4959387          DOI: 10.1186/s12911-016-0318-z

Source DB:  PubMed          Journal:  BMC Med Inform Decis Mak        ISSN: 1472-6947            Impact factor:   2.796


Background

The occurrence of missing data is a major concern in machine learning and correlated areas, including medical domains. As the quality of knowledge extracted from data is largely dependent from the quality of data, records with missing values may have a significant impact on descriptive and inferential statistics as well as on predictive analytics. Missing data can be handled in different ways, but simply ignoring them via deletion methods (e.g. listiwise deltetion) can be an inappropriate choice under many circumstances and besides a general loss of power this may lead to biased estimates of the investigated associations [1, 2]. The replacement of missing values with plausible values derived from the observation of a dataset via imputation procedures is, in most cases, a far better and valuable solution. Several methods exists for imputing missing data [2-5], among the most popular there are the so-called hot deck imputation methods, that in their deterministic form include the “nearest neighbour” (NN) imputation procedure [6]. In the hot-deck imputation methods, missing values of cases with missing data (recipients) are replaced by values extracted from cases (donors) that are similar to the recipient with respect to observed characteristics. NN imputation approaches are donor-based methods where the imputed value is either a value that was actually measured for another record in a database (1-NN) or the average of measured values from k records (kNN). The most notable characteristics ok NN imputation are: a) imputed values are actually occurring values and not constructed values, b) NN makes use of auxiliary information provided by the x-values, preserving thus the original structure of the data and c) NN is fully non parametric and does not require explicit models to relate y and x, being thus less prone to model misspecification. Several studies showed that NN may be superior over other hot-deck methods even though results may be dependent from the choice of the metric used to gauge the similarity or the dissimilarity of recipients to donors [7]. Irrelevant or noisy features add random perturbations to the distance measure and hurt performance so that, for instance, points in high dimensional space belonging to the same class (in classification problems) or to the same cluster (in unsupervised clustering applications) have low similarity [8, 9]. The choice of different similarity measures may partially address this issue, but ultimately does not solve the problem [8, 10]. Several methods have been proposed to accommodate the noise and/or to ameliorate the performance of NN algorithms in classification problems, straightforwardly these methods have been applied in imputation problems as well. The use of several k neighbors is a first attempt to control noise and it is widely accepted that small value of k have high influence on the results. kNN proved effective in imputing microarray data with an increased performance, as assessed by the normalized root mean squared error (RMSE), when k is > 1 [11]. In the nearest-variable procedure (kNN-V) and variants (kNN-H and kNN-A) described in [12] k relevant features are selected with respect to the variable with missing values by means of statistical correlation measures; evaluation in real-life and synthetic datasets by means of the RMSE showed a good performance of this method with respect to the selection of neighbors based on intra-subject distance. Other methods that have been proposed to improve the performance of NN to decorrelate error and wade through the noise in classification problems include the use of multiple NN classifiers. In [13] multiple NN classifiers based on random subsets of features are used and the performance of this ensemble was less prone to corruption by irrelevant features as compared to 1-NN or to kNN. Albeit NN is traditionally considered a stable, with low-variance, algorithm that could be not improved by other resampling techniques, such as bagging [14], other experiments indicate that bagging can actually improve the performance of NN provided that the resampling size is adequately below a minimum threshold [15]. Despite these premises, the performance of ensemble methods for NN imputation has not been assessed so far. A critical and often overlooked point in the evaluation of imputation methods, is the effect the imputed datum has on data structure and on the consequent risk of distorting estimates, standard errors and hypothesis tests despite an apparent good performance on other quality metrics. Imputed data are thus not necessarily more useful or usable and while there are situations where changing the distribution is not of concern, in many cases changing the underlying distribution has a relevant impact on decision making. In the present paper we assessed the performance of the NN algorithm and modifications in synthetic as well as in real-life datasets, quantifying the effect imputation yields on the data structure and on inferential and predictive statistics.

Methods

Frameworks for imputation

NN algorithms are similarity-based methods that rely on distance metrics and results may change in relation to the similarity measure used to evaluate the distance between recipients and donors. In our work, we used the Minkowski norm as metric to evaluate distance: The Minkowski norm assumes the form of the Euclidean or L2 distance when p = 2 or the form of the Manhattan (city-block) distance when p = 1; other fractional norms for p < 1 have been described [8]. In our experiments we set p = 0.5, 1 or 2. In the present paper we only focused on imputation problems with continuous or dichotomous variables, hence there was no need to consider other similarity measures for categorical or ordinal data as described in [16]. Three main NN variants were used for evaluation: a) 1-NN, with one donor selected per recipient, b) kNN with k = 5 donors per recipient and, c) weighted wKNN method with k = 5 and weighting in relation to the distance of the full set of donors to the recipient as described by Troyanskaya et al [11]. The three NN algorithms were then applied to these frameworks: Plain NN framework: the full set of data is used according to the hot-deck method and only complete cases with no missing data C(X) are considered as donors. Filtered NN framework: before imputation of the recipient X, the full set with no missing data C(X) is filtered to select a subset of features relevant to the missing variable to be imputed (X). To this end, C(X) is considered as a dataset in the context of a regression problem, where the variable with the missing datum (X) is set as the class variable and the other q variables (X, X, …, X) as predictors. Since in real-life situations there is usually no clue as to whether any relation exists between predictors and outcome or, if this relation exists, what form it takes, a fully non-parametric selection algorithm is considered an appropriate choice. In the present context, we applied the RReliefF algorithm described in [17]; the set is then filtered to select a subset C(X) ⊂ C(X) where (X, X, …, X) ⊂ (X, X, …, X) and s < q. In the present context we set the number of neighbors for RReliefF equal to 10 and set s as 10 %, 20 % or 30 % of q. As C(X) is invariant to X, the filtering step is performed only once before the NN imputation step that, on the contrary is performed separately for each X. Bagged NN framework: for each recipient X, from the set of donors with no missing data C(X) with size m, a random subset of donors with size m  < < m is selected with replacement such as C(X) ⊂ C(X); the subset C(X) is then used to impute the missing value for the recipient. The procedure is repeated n number of times with different random subsets of donors so that several possible values XX…X are derived from each C(X), C(X),…, C(X) random subsets; the final imputed values for X is calculated as the arithmetic mean of the X imputed values. In our experimental setup m is set to 10 % of m and n to 50 random runs. The random procedure of bagging, provided that m/m → 0 [15], is expected to be helpful in a dataset with several noisy variables that affect the evaluation of the distance from the recipient X to the donors. As a result of noise, cases that are dissimilar from X with respect to the attributes correlated with the missing variable to be imputed (X), are factiously selected as “close” to the recipient when they actually lie “far” from it. Bagging the donors with the abovementioned constrains, would help to eliminate such donors and to wade through the noise generated by irrelevant features correcting in part the overly-strong simplicity bias in the NN learner because NN is making incorrect assumptions about the domain and error is reduced by changing it [18]. Random NN framework: for each recipient X a random subset of attributes l such as that l = is selected and the corresponding set of donors with no missing data in the l attributes C(X) ⊂ C(X) is then considered. The procedure is repeated n number of times with different random subsets of attributes so that several possible values XX…X are derived from each C(X), C(X),…, C(X) subsets; the final imputed values for X is calculated as the arithmetic mean of the X imputed values. In our experimental setup n is set to 50 random runs. Overall, this frameworks differs from c) in that randomization is introduced in the selection of attributes rather in the selection of donors. The repeated random selection of attributes would for each random run favour the removal of irrelevant attributes that may bias the distance metric from recipient to donors and thus lead to unreliable proximities. The procedure is expected to have a high chance of success when in a dataset irrelevant (noisy) attributes outnumber the non-noisy attributes correlated with the missing variable to be imputed (X) as the odds are in favour of keeping non-noisy variables after the random selection of few attributes. During each random run we do expect to partially control the noise so that it is accommodated to a meaningful extent by the ensemble as the “true” imputation values derived from informative voters outnumber the “random” or “false” imputation values derived from non-informative voters [13]. Full Randomized NN framework: the procedures described in c) and d) are combined together in a random forest-like fashion [19]. This way a double randomization is introduced in the learning algorithm: firstly a random subset of l = attributes is selected from the set of donors with no missing data C(X) so that C(X) ⊂ C(X) and then a random subset of donors with size m  < < m is selected with replacement from C(X) to generate a fully randomized subset C(X) ⊂ C(X) ⊂ C(X). The procedure is repeated 50 times and, as in c) and d), the final imputed value is the arithmetic mean of values derived from the 50 random runs. The five frameworks used to evaluate the NN algorithms were tested against a mean imputation method where the missing value is replaced by the variable mean of complete cases.

Simulated datasets

To evaluate the performance of the imputation algorithms described above, we generated simulated dataset with a known pattern of missing at random data. In the present context we considered both missing completely at random (MCAR) and missing at random (MAR) patterns of randomness [4]. The first refers to the case when the probability of an instance (case) having a missing value for an attribute does not depend on either the known values or the missing data; the latter refers to the case when the probability of an instance having a missing value for an attribute depends on observed values, but not on the value of the missing data itself. To generate the simulated dataset we firstly created a population of 1*106 individuals with well-known dependencies among attributes and with a linear relationship with a variable Y such that: where β0 = log [2], β1 = log [4], β2 = log (0.5) and β3 = log (0.25) and ε is a random error normally distributed with μ = 40 and ϭ = 4. X and X are drawn from a multinormal random distribution that encompasses another variable X and whose parameters are: X, μ = 200 and ϭ = 40, X, μ = 50 and ϭ = 10, X, μ = 100 and ϭ = 20, Xby X ρ = 0.75, Xby X ρ = 0.6, Xby X ρ = 0.5. X is drawn from a multinormal random distribution with one variable X and whose parameters are: X, μ = 20 and ϭ = 4, X, μ = 500 and ϭ = 100, Xby X ρ = 0.5. X is drawn from a non-central chi-square distribution with μ = 20 and 3 degree of freedom. Two additional variables that are used to generate MAR data were then added to the population. X is drawn from a Bernoulli distribution with p = 0.5 and X is drawn from a uniform distribution in the range [30, 60]. Fifty additional variables drawn from different distributions were finally added to the dataset to generate a random noise and whose setting parameters were randomly generated from uniform distributions. These comprise: 15 variables drawn from a non-central chi square distribution with μ = [10, 50] and degree of freedom = [2, 5], 20 normally-distributed variables with μ = [10, 200] and ϭ = μ/5, 3 blocks of 3 variables drawn from a multinormal distribution with μ = [10, 100] and ϭ = [μ/5, μ/2] and 3 blocks of 2 variables drawn from a multinormal distribution with μ = [10, 100] and ϭ = [μ/5, μ/2]. From the population of 1*106 individuals, 400 cases were randomly drawn. Randomness was then introduced for variables X, X and X, so that each variable had 15 % or 30 % of missing cases. Missingness for X and X was introduced with a MCAR schema removing the desired number of cases in relation to a Bernoulli distribution with p = 0.15 or 0.30. Missingness for X was introduced according to a MAR schema so that it was dependent from X (binomial with equal probability) and X (uniform in the range [30, 60]). Four dummy categories were created, A: X = 0 and X < 60th percentile of [30, 60], B: X = 1 and X < 60th percentile of [30, 60], C: X = 0 and X ≥ 60th percentile of [30, 60], D: A: X = 1 and X ≥ 60th percentile of [30, 60]; each category was given a different risk of having missing cases so that Risk (A) = 1, Risk (B) = 1.5 * Risk (A), Risk (C) = 1.5 * Risk (B) and Risk (D) = 1.5 * Risk (C). The sampling procedure was repeated 500 times for each sample size/percentage of missingness.

Evaluation of results

After imputation, several estimators from the sampled sets can be calculated and compared with the true parameters θ observed in the whole population of 1*106 individuals. From these we can calculate the Bias the variance, Var and the mean squared error, MSE which are defined as follows: Where X is the measure of interest, more specifically we considered: The regression coefficients for the variables with missing values in equation (1) as calculated by the least squared method: β0, β1 and β2. The correlation between expected Ŷ values calculated inserting the derived regression coefficients and the values of X, X, X after imputation into equation (1) and the true Y values. The mean of X, X and X. The standard deviation of X, X and X. Additionally, we provided a measure of inaccuracy of imputation defined as the mean of the proportional difference between true and imputed values in the n variables with missing values, n:

Real-life dataset

The performance of the different learning frameworks was finally tested in a real-life dataset. To this end, we considered the single proton emission computed tomography (SPECT)-F heart dataset described by Kurgan et al [20] (available at: https://archive.ics.uci.edu/ml/datasets/SPECTF+Heart). The dataset describes SPECT data in 267 patients with suspected coronary artery disease and comprises a continous outcome and 44 continuous attributes from SPECT images in 22 regions of interest at rest and after stress, respectively termed F1R, F2R,…, F22R and F1S, F2S,…, F22S. To mirror the evaluation procedure described for synthetic datasets, we firstly determined the regression coefficients of a binary logistic regression equation modelled using a stepwise entry method (entry p = 0.01, exit p = 0.05). The final model thus obtained was as follow: Y = 44.133 – 0.94 * F5S - 0.123 * F13S - 0.207 * F16S - 0.201 * F20S. A 15 % of random missingness was then introduced in the 4 variables included in the regression equation via the MCAR schema. The MSE was chosen as the primary performance measure to evaluate the ability of the different learning frameworks to impute missing values. To this end we considered a) the ability to correctly infer the regression coefficients for the variables of interest, b) the inaccuracy in the imputed values, c) the distortion of data as assessed by the standard deviations of the 4 variables. Overall the MCAR/imputation procedure was repeated 500 times; only the wKNN method was used to learn the imputed values and k was set equal to 1, 3 or 10 neighbors. To summarize the performance of the different learning frameworks and to establish the trade-off between inaccuracy of imputation and distortion of data we proceeded as described hereafter. The mathematical mean between the normalized inaccuracy values (across the different frameworks) and the normalized standard deviation values (across the different frameworks) was calculated for each variable and ranked from the lowest (the best) to the highest (the worst). The average rank of the four variables was then ranked to produce a readily interpretable summary value.

Results

Our analysis was conducted to individuate the NN procedure that, at the same time, produced the best imputation performance with the minor distortion in the distribution of data. Under this view, all the distance metrics we considered produced the same results and the same conclusions can be drawn for the different p values we tested; for sake of brevity hereafter we’ll only show the results obtained with the Euclidean distance. Table 1 summarizes the effect of the different NN algorithms/learning frameworks on the regression coefficients for the variable of interest Y. kNN algorithms had the overall best performance as assessed by the MSE; results are independent from the mechanism of randomness and can be observed both for MAR (β0) and MCAR (β1 and β2) data. Among the different frameworks, the best results were obtained when noisy variable were filtered via RReliefF. The results obtained by resampling methods seem to be independent from the number of neighbors in the kNN algorithm and usually perform slightly worse than less computationally expensive methods.
Table 1

Regression Coefficients, average of 500 samples of n = 400, k = 5, 15 % or 30 % of missing data

β0 β1 β2
Missing15 %30 %15 %30 %15 %30 %
FrameworkMethodBiasVarMSEBiasVarMSEBiasVarMSEBiasVarMSEBiasVarMSEBiasVarMSE
Plain 1NN 0.18990.00120.03730.32770.00140.10880.19150.02650.06310.34960.04640.1685-0.10650.00580.0171-0.21560.00840.0549
kNN 0.09270.00050.00910.18170.00090.03390.0450.01810.02010.06380.04250.0465-0.02250.00390.0044-0.05890.00760.011
wkNN 0.09250.00050.00910.18110.00090.03370.04510.01810.02010.06390.04250.0465-0.02270.00390.0044-0.05940.00760.0111
RReliefF10 1NN 0.1580.00170.02670.30250.0030.09440.19010.02520.06130.36720.04150.1763-0.10030.00460.0146-0.20830.00820.0516
kNN 0.07370.0010.00640.15870.00250.02770.04620.01760.01970.08240.03840.0451-0.02270.00380.0043-0.06020.00730.0109
wkNN 0.07340.0010.00640.15840.00250.02760.04630.01750.01960.08270.03840.0452-0.02280.00370.0042-0.06020.00730.0109
RReliefF20 1NN 0.16040.00120.02690.2980.00230.09110.18030.02410.05660.35940.04430.1734-0.09760.00470.0142-0.20970.00730.0513
kNN 0.07150.00070.00580.1530.00170.02510.04070.01740.0190.07010.03990.0447-0.02080.00360.0041-0.06070.00720.0108
wkNN 0.07120.00070.00570.15240.00170.02490.04080.01730.01890.07070.03990.0448-0.02090.00360.0041-0.06090.00710.0108
RReliefF30 1NN 0.16170.00120.02740.29820.00180.09080.1830.02210.05550.34680.04380.164-0.09860.00490.0147-0.20050.00810.0483
kNN 0.0730.00060.00590.15330.00130.02480.03830.01550.0170.06330.03870.0426-0.02010.00360.004-0.05650.00680.01
wkNN 0.07270.00060.00580.15270.00130.02460.03840.01550.0170.0640.03880.0428-0.02020.00360.004-0.05650.00680.01
Bagging 1NN 0.08120.00040.0070.15930.00080.02620.01820.01760.01790.0020.03940.0393-0.0070.00380.0039-0.02440.00730.0079
kNN 0.08690.00040.0080.1680.00080.0290.00950.01760.0176-0.0150.03940.0395-0.00060.00390.0039-0.00980.00740.0075
wkNN 0.08570.00040.00780.16510.00080.0280.00960.01750.0176-0.01530.03920.0394-0.00080.00390.0039-0.01040.00730.0074
Random 1NN 0.08740.00040.00810.15890.00070.02590.01210.0180.0181-0.01140.03870.0388-0.00030.00410.0041-0.01220.00740.0076
kNN 0.08720.00040.0080.15720.00070.02540.00810.01770.0177-0.02120.03850.03890.00140.0040.004-0.0070.00740.0074
wkNN 0.08720.00040.0080.15710.00070.02530.00810.01770.0177-0.02120.03850.03890.00140.0040.004-0.0070.00740.0074
Bagging + Random 1NN 0.09390.00050.00930.16860.00070.02910.01280.01870.0188-0.01090.03880.0388-0.00040.00410.0041-0.01010.00740.0075
kNN 0.09560.00050.00960.17150.00070.03010.01030.01850.0186-0.01780.03930.03950.0010.00410.0041-0.0060.00750.0075
wkNN 0.09530.00050.00950.17080.00070.02990.01020.01850.0185-0.01790.03920.03940.00110.00410.0041-0.0060.00750.0075
Mean Imputation0.1090.00050.01240.19130.00080.03730.01080.01960.0197-0.01470.04070.04090.00150.00440.0043-0.00470.00780.0078
Regression Coefficients, average of 500 samples of n = 400, k = 5, 15 % or 30 % of missing data When the correlation between estimated and true values of the dependent value Y were considered, the very same conclusions can be drawn (Table 2). The general good performance of kNN algorithms in the context of inferential statistics with the plain or filtering framework, is justified by the higher degree of accuracy (e.g. by the lower inaccuracy) these methods have compared to the competitors in imputing the missing value (Table 3). As expected, the inaccuracy of imputation for X was lower than the inaccuracy observed for X due to the higher number of dependencies in the simulated datasets; accordingly, X that was completely unrelated to other variables and had a non-normal (chi-square) distribution, had the highest degree of inaccuracy in the imputed values.
Table 2

Correlation between expected and actual values of the dependent variable Y as calculated from equation [1], average of 500 samples of n = 400, k = 5, 15 % or 30 % of missing data

Estimated vs actual Y
Missing15 %30 %
FrameworkMethodBiasVarMSEBiasVarMSE
Plain 1NN 0.169570.000720.029470.309590.001140.09698
kNN 0.109610.000350.012360.219870.000680.04902
wkNN 0.109470.000350.012330.219580.000680.04890
RReliefF10 1NN 0.151700.000750.023770.296080.001630.08928
kNN 0.100360.000420.010490.209630.001040.04498
wkNN 0.100230.000410.010460.209470.001040.04492
RReliefF20 1NN 0.152460.000630.023870.293310.001360.08738
kNN 0.098500.000350.010050.206070.000870.04333
wkNN 0.098390.000350.010020.205800.000870.04322
RReliefF30 1NN 0.152020.000680.023790.290810.001220.08579
kNN 0.098930.000330.010120.205080.000790.04285
wkNN 0.098770.000330.010080.204740.000790.04271
Bagging 1NN 0.102870.000300.010880.207560.000630.04370
kNN 0.106080.000300.011560.212400.000610.04572
wkNN 0.105440.000300.011420.210780.000610.04503
Random 1NN 0.106290.000300.011600.207380.000590.04359
kNN 0.106260.000300.011590.206380.000580.04317
wkNN 0.106220.000300.011580.206310.000580.04314
Bagging + Random 1NN 0.110100.000310.012430.212580.000600.04579
kNN 0.111010.000320.012640.214220.000600.04649
wkNN 0.110830.000320.012600.213860.000600.04633
Mean Imputation0.118570.000350.014410.225120.000630.05130
Table 3

Inaccuracy in the imputation of missing variables, average of 500 samples of n = 400, k = 5, 15 % or 30 % of missing data

Inaccuracy of imputed values
Missing15 %30 %
FrameworkMethodX0 X1 X 2 X0 X1 X 2
Plain 1NN 0.205130.229720.565890.207400.231130.56644
kNN 0.161350.180780.458040.164670.182520.46226
wkNN 0.161190.180720.458140.164430.182460.46245
RReliefF10 1NN 0.185850.226630.563610.194240.232590.56324
kNN 0.148610.179710.456340.154610.183050.46181
wkNN 0.148460.179680.456430.154470.183050.46184
RReliefF20 1NN 0.188080.224160.561860.192470.228910.56218
kNN 0.148470.177850.454500.153130.181810.45896
wkNN 0.148310.177810.454630.152910.181790.45918
RReliefF30 1NN 0.188350.224930.558070.193710.228060.55949
kNN 0.149180.177700.453810.153770.180580.45964
wkNN 0.148990.177670.453940.153530.180560.45974
Bagging 1NN 0.157930.172420.439020.161490.175390.44266
kNN 0.162580.171340.431480.166700.174580.43561
wkNN 0.161900.171190.431620.165580.174310.43583
Random 1NN 0.162440.171640.431410.162530.174000.43639
kNN 0.162990.170990.429320.162980.173070.43380
wkNN 0.162940.170970.429320.162920.173060.43381
Bagging + Random 1NN 0.166350.172870.431590.166360.174900.43593
kNN 0.167850.172380.429070.168230.174360.43371
wkNN 0.167650.172330.429070.167990.174290.43371
Mean Imputation0.175760.174120.428190.175600.176160.43270
Correlation between expected and actual values of the dependent variable Y as calculated from equation [1], average of 500 samples of n = 400, k = 5, 15 % or 30 % of missing data Inaccuracy in the imputation of missing variables, average of 500 samples of n = 400, k = 5, 15 % or 30 % of missing data Even if kNN and the other complex methods seem to have a superior performance in inferential statistics as compared to simple 1NN (or 1NN after filtering), these methods caused a not irrelevant distortion of data. The means of the imputed variables were not differentially affected by these methods (results not shown), yet standard deviations were greatly influenced by complexity (Table 4). The detrimental effect on standard deviation is due to a general “flattening” around the mean of the imputed values; it is noteworthy to observe that for resampling schemas the imputed value is indeed very similar to the value obtained by the mean imputation method. The very same effect can be observed with a different degree of magnitude for kNN. Figure 1 plots the distribution of X values in absence of missingness and after imputation with k = 1, 3 or 10 neighbors in an additional experiment of 100 imputation runs in samples of size n = 400, MCAR = 30 % in the context of the plain framework with the kNN algorithm. As it can be clearly observed, only 1NN preserves the original variability in the distribution of data while the distribution of X is gradually distorted as the number of k increases. To evaluate the trade-off between inferential statistics and distortion of data we next plotted in Fig. 2 the inaccuracy of imputation vs the MSE of the standard deviation of the mean. As it can be observed, the inaccuracy of imputation decreases as the number of neighbors increases, yet this causes a gradual increase in the MSE of the standard deviation due to an unwanted reduction in the original dispersion of data. The best trade-off between inaccuracy and preservation of data structure, that is the average between normalized inaccuracy and normalized MSE of the standard deviation, is the point where the two curves intersect and corresponds to 3 neighbors. The very same optimal k point could be obtained re-running the experiments in samples with larger sizes (n = 1600) or with the filtered NN framework.
Table 4

Standard deviation of the mean for the imputed variables, average of 500 samples of n = 400, k = 5, 15 % or 30 % of missing data

X0 X1 X 2
Missing15 %30 %15 %30 %15 %30 %
FrameworkMethodBiasVarMSEBiasVarMSEBiasVarMSEBiasVarMSEBiasVarMSEBiasVarMSE
Plain 1NN 0.50742.37752.63021.00863.56074.57080.02320.02790.02840.0430.04230.0441-0.00220.19520.19480.04870.33960.3413
kNN 2.4551.97127.99435.11692.259228.43710.25750.02220.08840.5230.02380.29720.57010.14690.47161.24260.16431.708
wkNN 2.45281.97047.98265.11142.262628.38440.25730.02220.08830.52250.02380.29670.56980.14690.47131.24120.16461.7049
RReliefF10 1NN 0.43932.40042.58850.7613.32073.89310.02190.02760.02810.03160.03660.03750.0150.17090.17070.09960.25140.2608
kNN 2.09912.06236.46424.51122.53322.87880.25060.02280.08560.50980.02490.28480.57370.14490.47381.24860.15751.7162
wkNN 2.09462.06336.44644.49872.535722.76870.25040.02280.08550.50910.0250.28410.57350.14480.47341.24760.15771.7138
RReliefF20 1NN 0.56332.28922.60190.9943.25274.23430.03210.02610.02710.05110.03540.03790.02790.17860.1790.10110.26490.2746
kNN 2.23152.02256.9984.722.314624.58860.25410.02220.08670.5150.02460.28980.5760.14370.47521.25590.15631.7334
wkNN 2.22752.02196.97964.70882.317824.48590.25390.02220.08660.51430.02460.28910.57570.14370.47481.25450.15641.7297
RReliefF30 1NN 0.62132.35582.73711.08263.4174.58210.03210.02520.02620.06060.03670.04020.0290.17810.17860.14710.26470.2858
kNN 2.28912.00227.23824.83522.379325.75410.25570.02240.08780.51930.0240.29360.5780.14550.47931.25880.1551.7392
wkNN 2.28632.00327.22634.82562.379425.66060.25550.02240.08760.51860.0240.29290.57760.14540.47871.25740.15511.7358
Bagging 1NN 2.83711.9269.97155.96962.086537.71850.29940.02190.11150.61240.02260.39760.66780.14380.58951.45320.15152.2629
kNN 3.02061.907611.02796.3632.018142.50120.31640.02180.12190.64780.02220.44180.7060.14320.64131.53540.14732.5044
wkNN 3.01411.908410.98926.3422.022942.23910.3160.02180.12160.64640.02220.440.70510.14320.64011.53220.14762.4949
Random 1NN 2.98561.907910.81776.24092.029740.97380.31320.02180.11990.63860.02220.42990.69840.14250.631.5160.14722.4451
kNN 3.04771.903211.1886.37972.014242.71080.31930.02180.12370.6520.02220.44720.71230.14260.64971.54720.14662.5401
wkNN 3.04731.903211.18546.37872.014342.69790.31930.02180.12370.65190.02220.44720.71230.14260.64961.54710.14662.5397
Bagging + Random 1NN 3.01441.906810.98956.30792.016341.80150.31490.02180.1210.64230.02220.43470.70150.14270.63451.52380.14742.469
kNN 3.07421.901111.34786.44162.005543.49560.32050.02180.12450.65480.02220.45090.71410.14260.65231.5520.14662.5549
wkNN 3.07331.901211.34246.43932.005643.46610.32050.02180.12450.65470.02220.45080.71410.14260.65221.55180.14662.5542
Mean Imputation3.10241.898511.51986.50121.998144.25950.32240.02180.12570.6590.02220.45640.71810.14250.65781.56060.14652.5817
Fig. 1

Distribution of data for the variable X before removal of missing cases and after imputation with the kNN algorithm, setting k equal to 1, 3 or 10 neighbors

Fig. 2

Trade-off between inaccuracy of imputation and MSE of the standard deviation (SD) for the kNN algorithm in relation to the number of k neighbors (x-axis); normalized values are shown; variable of interest: X

Standard deviation of the mean for the imputed variables, average of 500 samples of n = 400, k = 5, 15 % or 30 % of missing data Distribution of data for the variable X before removal of missing cases and after imputation with the kNN algorithm, setting k equal to 1, 3 or 10 neighbors Trade-off between inaccuracy of imputation and MSE of the standard deviation (SD) for the kNN algorithm in relation to the number of k neighbors (x-axis); normalized values are shown; variable of interest: X The capability of the imputation frameworks was finally tested in a real-life dataset with 15 % of MCAR values in 4 variables of interests. As observed in synthetic datasets, the use of several neighbors or of complex learning schemas introduced a non-negligible distortion of data despite a good performance in imputing the correct value or in inferring the regression coefficients (Table 5). Overall, the trade-off between (in)accuracy in imputation and preservation of data was most favourable for few neighbors and when filtering via ReliefF was applied. In this specific setting, the use of 3 neighbors in conjunction with ReliefF 20 % or 30 % yielded the best performance.
Table 5

Performance of the different imputation algorithms in the SPECTF dataset with 15 % of cases with missing values (MCAR schema)

AVG RNKF5SF13SF16SF20S
FrameworkNNβInacc.SDRNKβInaccSDRNKβInaccSDRNKβInaccSDRNK
Plain 1NN 110.000550.102360.0764520.000480.257510.20554120.001880.100250.25897140.002080.086540.2226014
3NN 130.000370.086540.1419810.000260.248290.55387150.001130.092870.37754150.001060.093240.3860415
10NN 160.000440.093240.2979860.000240.259260.93204170.000910.092820.43864170.000850.136710.6772917
RReliefF10 1NN 120.001080.136710.10879170.000810.216630.1080480.002450.092520.2141390.002880.115350.160619
3NN 50.000560.115350.20818140.000380.186470.2524950.001080.081590.2695120.001270.109220.259772
10NN 100.000380.109220.36227160.000230.186320.4890490.000710.079240.3688670.000760.121700.482067
RReliefF20 1NN 40.000790.121700.09823100.000490.189380.0944340.001770.088180.1966740.001980.103740.153134
3NN 10.000380.103740.2030870.000230.164920.2466310.000830.079120.2991610.000830.101940.273931
10NN 60.000300.101940.34090130.000160.169340.4607760.000610.078270.3806660.000540.116050.495486
RReliefF30 1NN 30.000600.116050.0950780.000390.186640.1056630.001530.088970.2114750.001800.099290.160415
3NN 20.000330.099290.1893440.000190.165700.2628820.000840.080410.3129030.000770.099050.291943
10NN 70.000270.099050.3264790.000150.172190.4782770.000630.080110.3927780.000520.098650.510138
Bagging 1NN 170.000520.098650.36311120.000260.265731.06013180.001010.092730.45949180.000840.106790.7951118
3NN 200.000640.106790.43314180.000330.280171.20518200.001170.094120.48350200.001030.120340.9238120
10NN 210.000600.120340.53072210.000420.294741.32230210.001340.096480.52562210.001250.08574104.13821
Random 1NN 80.000230.085740.2711230.000130.177890.61912100.000630.079410.42192100.000410.087730.4788710
3NN 90.000250.087730.3132050.000130.183420.71124110.000630.080530.44411110.000400.094470.5542911
10NN 140.000260.094470.39598110.000140.195700.83654130.000640.082740.46791120.000420.100050.7013312
Bagging 1NN 150.000280.100050.43214150.000150.208180.90011140.000680.084640.47270130.000450.106360.7700013
+Random 3NN 180.000290.106360.47922190.000160.224591.02479160.000710.086580.49146160.000470.117500.8669516
10NN 190.000340.117500.52735200.000210.257491.19241190.000820.090740.51276190.000560.130350.9685419
Mean Imputation220.000600.130350.59451220.000640.303191.38048220.001930.100970.54523220.002440.235021.1184822
Performance of the different imputation algorithms in the SPECTF dataset with 15 % of cases with missing values (MCAR schema)

Discussion

In the present paper we explored the properties of NN imputation method under different learning frameworks to establish under what circumstances the imputation algorithm yields the best performance in terms of inference and capability to preserve the fundamental nature of data distribution. Overall we showed that: a) NN imputation may have a favourable effect on inferential statistics and that b) the precision of imputation is dependent from the degree of dependencies the variable with missing data has with other variables in the dataset and may be negligible for totally uncorrelated attributes; c) resampling methods do not offer any clear advantage with respect to conventional NN imputation methods; d) ReliefF selection algorithms (RReliefF for continous attributes) may help to wade through the noise and improve the performance of NN algorithms without causing any distortion in the data structure; e) the original data structure is only preserved with 1NN while for any value of k > 1, standard deviations are significantly affected and inflated; f) the use of a small number of k may represent a good compromise between performance and need to preserve the original distribution of data; g) in simulations with medium-sized datasets and a number of noisy variables, the best results in terms of both imputation accuracy and preservation of data distribution can be obtained using kNN with small k in conjunction with ReliefF filtering; h) the very same conclusions as in point g can be reached when a real-life dataset with 15 % of MCAR data is taken into account. The concept that methods capable of weighting the information provided by different variables may improve the performance of NN has previously been explored in [12] albeit with some substantial differences as compared to our approach. These authors used parametric measures based on the classical Pearson’s correlation to evaluate the distance between the variable to be imputed and the remaining attributes and then borrowed the necessary information from the nearest variables (or in their hybrid approach, by both the nearest variables and the nearest subjects). Compared to the selection by ReliefF family algorithms, this method considers only linear bivariate interactions and may not be the optimal choice when the data structure is completely unknown, underestimating the relevance of attributes in the multidimensional space. In our implementation, variables were selected on the basis of ranking and no weighting on the basis of the attributes’ scores was made. This approach proved effective even selecting a small fraction of variables (e.g. 10 %), yet at the moment we cannot provide any guidance about the optimal number of attributes to select. As previously shown in classification problems [17], this number is context-dependent, however the choice of the most parsimonious model seems reasonable as we do not expect many correlations among variables. Most notably, the selection of the top 20 % or 30 % ranking ReliefF attributes proved effective in obtaining the best performance in a medium-sized real-life dataset with unknown structure or dependencies among variables. One of the most striking findings of our simulation is that when kNN imputation is chosen, it is advisable to limit the number of k neighbors, because of the risk to severely impair the original variability of data. Again, we cannot provide an optimal number of neighbors to select, however both in the simulations we conducted and in the test in the real-life dataset we subsequently performed, a value of k = 3 seem a reasonable choice. These findings are of paramount importance because contrarily to the common notion derived from the work of Troyanskaya et al [11], a value f k ranging from 10 to 20 may not be appropriate unless data distortion is completely neglected and the accuracy of the imputed data (as measured for instance by the MRSE) is the sole outcome of interest. In our simulation, we considered a coloured form of noise and we did not just added a “white” Gaussian noise to the variables of interest, including among the irrelevant variables blocks of correlated attributes as well as unrelated attributes normally or diversely (e.g. non central chi-square) distributed and thus our findings are robust against different types of noise. Despite this, the limitations of a simulation setting should be acknowledged as real-life datasets are far more complex and challenging. Even the final test we conducted in the SPECTF heart dataset is not fully exhaustive of what a researcher may encounter in the real-life, as we considered only a MCAR mechanism to create the missing data. The conclusions we draw applies to cases with moderate sizes of missingness, no lower than 15 % and no higher than 30 %; we intentionally limited our evaluations to this range as for small amounts of missing data, under the MAR or MCAR mechanisms, imputation may be useless and for larger amounts caution should always be applied because estimates may become very imprecise [21]. Thus, despite the efficiency of NN imputation under these conditions, it should remembered that imputation should be carefully applied and cannot solve all the problems of incomplete data [22] and that NN imputation can have serious drawbacks as we showed for instance considering the risk of distorting data distribution or the lack of precision in imputing variables with no dependencies in a dataset or, conversely, the possibility to introduce spurious associations considering dependencies where they do not exist.

Conclusions

The use of ReliefF selection algorithms in conjunction with kNN imputation methods, provided that k are adequately low, gives and adequate trade-off between precision of imputation and capability to preserve the natural structure of data. The use of large number of k neighbors is only apparently useful in NN imputation problems as the gain in precision masks a striking distortion in the true distribution of data.

Abbreviations

MAR, missing at random; MCAR, missing completely at random; MSE, mean squared error; NN, nearest neighbour
  6 in total

1.  Knowledge discovery approach to automated cardiac SPECT diagnosis.

Authors:  L A Kurgan; K J Cios; R Tadeusiewicz; M Ogiela; L S Goodenday
Journal:  Artif Intell Med       Date:  2001-10       Impact factor: 5.326

2.  Missing value estimation methods for DNA microarrays.

Authors:  O Troyanskaya; M Cantor; G Sherlock; P Brown; T Hastie; R Tibshirani; D Botstein; R B Altman
Journal:  Bioinformatics       Date:  2001-06       Impact factor: 6.937

Review 3.  A critical look at methods for handling missing covariates in epidemiologic regression analyses.

Authors:  S Greenland; W D Finkle
Journal:  Am J Epidemiol       Date:  1995-12-15       Impact factor: 4.897

4.  A Review of Hot Deck Imputation for Survey Non-response.

Authors:  Rebecca R Andridge; Roderick J A Little
Journal:  Int Stat Rev       Date:  2010-04       Impact factor: 2.217

Review 5.  Applications of multiple imputation in medical studies: from AIDS to NHANES.

Authors:  J Barnard; X L Meng
Journal:  Stat Methods Med Res       Date:  1999-03       Impact factor: 3.021

6.  Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

Authors:  Serena G Liao; Yan Lin; Dongwan D Kang; Divay Chandra; Jessica Bon; Naftali Kaminski; Frank C Sciurba; George C Tseng
Journal:  BMC Bioinformatics       Date:  2014-11-05       Impact factor: 3.169

  6 in total
  78 in total

Review 1.  Insights into Computational Drug Repurposing for Neurodegenerative Disease.

Authors:  Manish D Paranjpe; Alice Taubes; Marina Sirota
Journal:  Trends Pharmacol Sci       Date:  2019-07-17       Impact factor: 14.819

2.  Small Noncoding RNA CjNC110 Influences Motility, Autoagglutination, AI-2 Localization, Hydrogen Peroxide Sensitivity, and Chicken Colonization in Campylobacter jejuni.

Authors:  Amanda J Kreuder; Brandon Ruddell; Kathy Mou; Alan Hassall; Qijing Zhang; Paul J Plummer
Journal:  Infect Immun       Date:  2020-06-22       Impact factor: 3.441

3.  Comprehensive anticancer drug response prediction based on a simple cell line-drug complex network model.

Authors:  Dong Wei; Chuanying Liu; Xiaoqi Zheng; Yushuang Li
Journal:  BMC Bioinformatics       Date:  2019-01-22       Impact factor: 3.169

4.  Multiple predictively equivalent risk models for handling missing data at time of prediction: With an application in severe hypoglycemia risk prediction for type 2 diabetes.

Authors:  Sisi Ma; Pamela J Schreiner; Elizabeth R Seaquist; Mehmet Ugurbil; Rachel Zmora; Lisa S Chow
Journal:  J Biomed Inform       Date:  2020-01-28       Impact factor: 6.317

5.  Artificial intelligence predicts disk re-herniation following lumbar microdiscectomy: development of the "RAD" risk profile.

Authors:  Garrett K Harada; Zakariah K Siyaji; G Michael Mallow; Alexander L Hornung; Fayyazul Hassan; Bryce A Basques; Haseeb A Mohammed; Arash J Sayari; Dino Samartzis; Howard S An
Journal:  Eur Spine J       Date:  2021-06-07       Impact factor: 3.134

6.  A Hybrid Approach for the Stratified Mark-Specific Proportional Hazards Model with Missing Covariates and Missing Marks, with Application to Vaccine Efficacy Trials.

Authors:  Yanqing Sun; Li Qi; Fei Heng; Peter B Gilbert
Journal:  J R Stat Soc Ser C Appl Stat       Date:  2020-05-22       Impact factor: 1.864

7.  Prepregnant Obesity of Mothers in a Multiethnic Cohort Is Associated with Cord Blood Metabolomic Changes in Offspring.

Authors:  Ryan J Schlueter; Fadhl M Al-Akwaa; Paula A Benny; Alexandra Gurary; Guoxiang Xie; Wei Jia; Shaw J Chun; Ingrid Chern; Lana X Garmire
Journal:  J Proteome Res       Date:  2020-02-27       Impact factor: 4.466

8.  Analysis of substance use and its outcomes by machine learning I. Childhood evaluation of liability to substance use disorder.

Authors:  Yankang Jing; Ziheng Hu; Peihao Fan; Ying Xue; Lirong Wang; Ralph E Tarter; Levent Kirisci; Junmei Wang; Michael Vanyukov; Xiang-Qun Xie
Journal:  Drug Alcohol Depend       Date:  2019-10-22       Impact factor: 4.492

9.  Sex Differences in Outcomes After STEMI: Effect Modification by Treatment Strategy and Age.

Authors:  Edina Cenko; Jinsung Yoon; Sasko Kedev; Goran Stankovic; Zorana Vasiljevic; Gordana Krljanac; Oliver Kalpak; Beatrice Ricci; Davor Milicic; Olivia Manfrini; Mihaela van der Schaar; Lina Badimon; Raffaele Bugiardini
Journal:  JAMA Intern Med       Date:  2018-05-01       Impact factor: 21.873

10.  Prediction of evening fatigue severity in outpatients receiving chemotherapy: less may be more.

Authors:  Kord M Kober; Ritu Roy; Anand Dhruva; Yvette P Conley; Raymond J Chan; Bruce Cooper; Adam Olshen; Christine Miaskowski
Journal:  Fatigue       Date:  2021-02-16
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.