Literature DB >> 35025953

Imputation by feature importance (IBFI): A methodology to envelop machine learning method for imputing missing patterns in time series data.

Adil Aslam Mir^1,2, Kimberlee Jane Kearfott³, Fatih Vehbi Çelebi¹, Muhammad Rafique⁴.

Abstract

A new methodology, imputation by feature importance (IBFI), is studied that can be applied to any machine learning method to efficiently fill in any missing or irregularly sampled data. It applies to data missing completely at random (MCAR), missing not at random (MNAR), and missing at random (MAR). IBFI utilizes the feature importance and iteratively imputes missing values using any base learning algorithm. For this work, IBFI is tested on soil radon gas concentration (SRGC) data. XGBoost is used as the learning algorithm and missing data are simulated using R for different missingness scenarios. IBFI is based on the physically meaningful assumption that SRGC depends upon environmental parameters such as temperature and relative humidity. This assumption leads to a model obtained from the complete multivariate series where the controls are available by taking the attribute of interest as a response variable. IBFI is tested against other frequently used imputation methods, namely mean, median, mode, predictive mean matching (PMM), and hot-deck procedures. The performance of the different imputation methods was assessed using root mean squared error (RMSE), mean squared log error (MSLE), mean absolute percentage error (MAPE), percent bias (PB), and mean squared error (MSE) statistics. The imputation process requires more attention when multiple variables are missing in different samples, resulting in challenges to machine learning methods because some controls are missing. IBFI appears to have an advantage in such circumstances. For testing IBFI, Radon Time Series Data (RTS) has been used and data was collected from 1st March 2017 to the 11th of May 2018, including 4 seismic activities that have taken place during the data collection time.

Entities: Chemical

Mesh：

Substances：
Soil
Radon

Year: 2022 PMID： 35025953 PMCID： PMC8758196 DOI： 10.1371/journal.pone.0262131

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Radon (222Rn) gas is ubiquitous in the environment. It is found in air, water, and soil, and concentrates in the environment and buildings in a complex manner dependent upon geological, chemical, meteorological, and other temporally variant parameters [1-10]. While the bulk of knowledge about the adverse health effects has resulted from studies of lung cancer in uranium miners, radon health effects are an active area of epidemiological work involving indoor domestic radon gas concentrations [10-14]. Such work often involves indoor radon air concentration time series coupled with data about various multiple environmental and geographic variables. Such data sets may be incomplete, resulting in the need to discard data or perform extrapolations using machine learning or other modeling methods. While radon is of concern when found in hazardously high concentrations in occupied dwellings, it has been found to be beneficial in that it is potentially predictive of earthquakes [15-25]. Various studies show that anomalies in the radon time series data offer strong evidence for earthquake prediction and forecasting [21, 26–30]. Decades of studies have specifically explored the linkages between SRGC and seismic activity [31]. Moreover, soil radon gas emission and transportation dynamics are influenced by various meteorological factors (such as temperature, rainfall, pressure and relative humidity) which are unrelated to seismological activities deeper in the earth crust which also influence radon gas environmental movement [32, 33]. Multiple studies had been performed to analyze the correlation between SRGC and different meteorological parameters [7, 34–38]. A study was conducted at Hokkaido University in Sapporo, Japan for monitoring soil radon gas concentration found that temperature was the dominant meteorological parameter affecting soil radon levels and variability [39]. Sahoo et al. [40] analyzed the influence of meteorological parameter on radon emission dynamics using linear regression analysis. It was observed that temperature is negatively correlated whereas humidity and pressure are positively correlated with radon time series. This study also reports a considerable amount of anomalies prior to the occurrences of local earthquakes with the magnitude of 3.7 and 4.2 Badargadh, India. Different computationally intelligent methods have been proposed and successfully applied to predict radon concentration from environmental parameters such as pressure, rainfall, air and soil temperature [27, 41–44]. Such predictions depend upon data sets which may often include missing information for radon or some of the important environmental parameters which influence its concentrations [45]. This paper concerns itself with a method for imputing, or filling in, missing data to improve the performance of machine learning approaches being considered for identifying seismic abnormalities from soil radon gas concentration (SRGC). Radon health effects and usage of radon as a precursor indication of earthquakes represent prime examples of the interaction of the atmosphere, lithosphere, and hydrosphere with human biology influenced by their behavior and the built environment. Improving the data sets for analysis is the overall goal of this work. If missing data are not properly imputed this may lead to unreliable outcomes. Within any time series lost data/info may result from human error, instruments failure, or downtime due to routine maintenance purposes [46]. The classification of missing data can be performed by the mechanism through which the missing data is generated [47]. The choice of imputation method is influenced by the actual causes and characteristics of the missing data, whether due to data loss, perceived inapplicability, or lack of relationship to a given situation [48]. The nature of absent data, or missingness, can be classified in three ways[47, 49]. The missing data is said to be completely at random (MCAR) if the probability of the data missing is the same for all the cases, i.e. the missingness of data is not related to the data itself. When the tendency of the data point to be missing is related to the observed data, but not the missing data, then it is called missing at random (MAR). Finally, for data missing not at random (MNAR), two possible reasons may occur: the missing data point depends on the hypothetical value or the missingness is related to some other variables in the data. To impute missing data, straightforward methods are typically used. Examples are complete or available case analysis, missing-indicator methods, and mean, median and mode imputation. Unfortunately, these approaches may result in severely biased estimates and inefficient analyses [50, 51]. Multiple imputation is a more sophisticated approach to handling missing data that performs better than other conventional methods [52-55]. However, there are certain pitfalls in multiple imputation analyses [56]. When dealing with highly skewed data, multiple imputation results in implausibly low or even negative values. In various scenarios, an analysis needs to explore the association between an outcome and one or more predictor variables, the missing values in the outcome variable result in neglecting the outcome variable in imputation procedure. The omitting of outcome variables would falsely weaken the association among predictors and outcome variables. Moreover, multiple imputation procedure is computationally intensive and some algorithms run repeatedly for better approximation, and its length increases with more missing data. Machine learning methods have been used to reconstruct incomplete and irregularly sampled experimental data for indoor radon gas concentrations [45]. A comparison of traditional statistical and machine learning with available controls methods of data imputation concluded that machine learning outperformed statistical methods and increased the prognosis accuracy significantly [57, 58]. Mital et al. [59] proposed a sequential imputation algorithm for the imputation of missing values in spatio-temporally daily time series precipitation records. The authors demonstrated that the proposed sequential imputation method by incorporating it with a spatial interpolation based on a Random Forest method has several benefits as the number of stations with incomplete records increases. However, the sequential imputation method does not add any extra information for spatial information if the stations having incomplete records decreases. Stochastic semi-parametric regression imputation was found to be superior to existing semi-parametric regression imputation for both simulated and real data [60]. An efficient imputation-based method was also proposed which uses an expectation-maximization (EM) algorithm for multivariate time series data under the assumption of normal distribution [61]. In this study, a more robust methodology for data imputation, Imputation by Feature Importance (IBFI), is proposed and its performance compared with the commonly applied mean, median, mode, hot-deck, and predictive mean matching statistical imputation methods. Actual SRGC data collected over a 14-month period during which time four seismic events occurred is used for the study. Simulations of missing data were made using the R package entitled “mice” [62] for 10, 20, and 30% of the data under MAR, MCAR, and MNAR scenarios. The XGBoost machine learning method was utilized as a base learner for this work. It is noted that any method may be used to impute complex missingness patterns using IBFI, and IBFI may be applied to any machine learning method, such as Random Forest and Naïve Bayes.

Materials and methods

Instrumentation and location

SRGC time series data were obtained on the fault line present in Muzaffarabad, a city in the Pakistani territory of Kashmir. The location of the soil radon measuring station is presented in Fig 1. A humidity-insensitive radon and thoron monitor (SARAD RTM 1688–2, Nuclear Instruments, Germany) recorded radon, thoron, temperature, humidity, and barometric pressure at latitude 34.39621 and longitude 73.47347. Readings were integrated over 40 min, resulting in 36 measurements every 24 h for more than 1 y. Additional details concerning the instrument and the resulting data are reported elsewhere [27, 63, 64]. The statistical details of the variables in soil radon gas concentration time series data are provided in

Fig 1

Soil radon measuring station located inside 150 km from the epicentre of the strongest earthquake since 1900 with the latitude, longitude of 34.39621 and 73.47347 respectively.

The dataset consists of 15692 radon and thoron measurements along with environmental parameters viz. temperature (°C), relative humidity, and pressure (mbar). Radon concentration (RN) varied from 13743 Bq/m3 to 28085 Bq/m3. Mean and median of radon time series were found to be 21364 Bq/m3 and 21569 Bq/m3 respectively. The temperature varied from 4 to 42.5°C during the study period.

Missing data simulation and analysis plan

Fig 2 displays the complete simulation and analysis plan for the current study. The overall SRGC dataset consists of the different attributes (or measured variables) radon, thoron, temperature, relative humidity, and pressure. For the sake of analysis of the imputation methods, the three different missingness patterns (MCAR, MNAR, and MAR) are introduced into the SGRC dataset resulting in modified data sets with 10, 20, and 30% of the data missing. The missing values are introduced into the dataset artificially by the R package entitled “mice” [62]. The core idea for introducing missing values in the multivariate dataset lies in the missing patterns. Where missing patterns are the mixture of variables with missing values and variables with available values [65]. The missing patterns with their frequency are shown in Fig 4. The complete dataset is divided into k subsets randomly based upon missing data patterns. The subset size depends upon the frequency vector which is the frequency of the certain pattern to be missing the complete dataset. The data rows in the subsets are considered to be a candidate for missing is based upon several factors such as missingness mechanism (MCAR, MNAR, and MAR). In MCAR scenarios, all the data rows in the subsets have an equal probability of being missing while in MNAR and MAR scenarios, the weighted sum scores are computed. More simply put, the weighted sum scores are the outcome of a linear regression equation. These scores provide the basis for candidates’ data rows to be missing or not. Finally, the data rows in the subsets are made missing or incomplete according to the missing data pattern along with its probability of being missing. After the introduction of missing values, these subsets are merged to make an incomplete dataset having missing values in different data rows.

Fig 2

Simulation plan of the study.

Fig 4

Proposed methodology to select a feature needed to imputed first in imputation by feature importance (IBFI) method.

The resulting nine altered SRGC data sets are then treated with six different data imputation methods. These include IBFI and the more common mean, median, mode, predictive mean matching (PMM), and hot-deck imputation methods. Performance metrics computed following the application of the imputation method include root mean squared error (RMSE), mean square error (MSE), root mean squared log error (RMSLE), mean absolute percentage error (MAPE), and percentage bias (PB). The performance of the imputation method is heavily dependent upon the ability of imputation method to impute values that are much nearer to the real value for each of these metrics. Descriptions of both the performance metrics and the imputation methods are given below.

Performance measures

To assess the performance of imputation models for imputing the missing values of radon, thoron, temperature, relative humidity, and pressure, the following five different statistical parameters are computed: Root mean square error (RMSE), root mean squared log error (RMSLE), mean absolute percentage error (MAPE), mean squared error (MSE), and percentage bias (PB). RMSE is a very frequently used performance evaluation measure for prediction models in many different areas, such as air pollution [66, 67]. This method is sensitive to outliers [68] because each error has an effect on RMSE that is proportional to the size of the squared error and thus larger difference between actual predicted value results in an excessively larger effect on it. RMSE is the square root of the average of squared errors computed over a total number of values T, specifically: The RMSLE is obtained from the log of predicted and observed values, namely: The RMSLE is employed when it is desirable to avoid over-penalizing huge differences in the predicted and observed values in the case when those values are very high numbers. While RMSE is sensitive to outliers and explodes the error term when these are present, RMSLE seriously scales down the impact of outliers. It should be noted that RMSLE penalizes the underestimation of the observed values more severely than it does for overestimation. The MAPE is a frequently used statistical measure of how accurate a prediction system is, computed from: The principal advantage of expressing the MAPE as a percentage, as opposed to simply reporting the mean absolute error, is that it is easier for researchers to conceptualize. The weakness arising from the normalization is that the MAPE becomes undefined datasets that contain values of 0. The Mean Squared Error (MSE) is a measure that finds out how much close the predicted and observed values are, and is given by: For each predicted value, the distance is measured from the corresponding actual value and then squares the resultant value. More simply put, the metric is the average of the squares of errors. The average tendency of the predicted value to be smaller or larger than that of its actual value is captured by the PB performance metric, defined as: A PB of 0 is considered to be an optimal value indicating accurate model simulation with values having low magnitude. Larger positive and negative values indicate overestimation and underestimation bias, respectively.

Mean, median, and mode imputation methods

The mean model for imputation is a method in which the mean of the observed cases (all non-missing values of the attribute of interest) of the certain variable serves as a replacement for missing values in that variable. The simple-to-use mean model inherently reduces the variability in the data, resulting in an underestimation of standard deviation and variance estimates. Median imputation substitutes the middlemost number in the observed values when these are arranged in order. Mode imputation replaces missing data with the most frequently occurring value for that particular variable. SRGC data consist of attributes that are continuous, meaning that no two values will be the same exactly. For this reason, for mode imputation kernel density estimation is used to produce a continuous estimate of the probability density function. The point at which the probability density function reaches a maximum is considered its mode. For kernel density estimation, R package “stats” [69] has been used. The function “density ()” with its default parameters are used to calculate the kernel density estimate. The function “density.default ()” uses the algorithm that first use a regular grid of at least 512 points to disperse the mass of the empirical distribution function. The fast Fourier Transform along with the discretized version of the kernel such as Gaussian is used to convolve the approximation. Finally, the densities at specified points are evaluated using linear approximation.

Hot deck imputation method

Hot deck imputation is the method to impute missing data of one or more features for a non-respondent, called the recipient, where each missing value is substituted with a practical response from a “similar” unit i.e. it involves replacing the targeted missing values with those from a “similar” responding unit (the donor). Though, Hot-deck imputation is an old but popular method of imputation because it is simple in concept and suitable for missing at random (MAR) patterns. The basic principle is to locate one appropriate donor value from the available observed case that is comparable to the missing case in some regards [70]. The donor is similar to the recipient for features observed in both cases. The random hot-deck imputation method involves the random selection of donors or respondents from a set of possible available donors called the donor pool. There are other versions of this method involving a single donor and values are replaced from that case, generally, the “nearest neighbor” based on some metric; these methods are called deterministic hot-deck methods as no randomness is involved in the donor selection. However, the hot deck imputation has certain limitations e.g., good matches of respondents or donors are required by it to recipients reflecting available covariate information. There are cases when the single donor may be chosen to accommodate several recipients’ leads to replication of values [71]. This replication of values causes several problems and there is an inherent risk that lot of missing values or even all of the missing values gets imputed from a single donor. The hot-deck method does not take the correlation of the variables into account when imputing values in different features. The imputation procedure is univariate and does not distinguish the multivariate nature of the dependent variables. Due to the copying or borrowing of value from the available case, another problem that arises when imputing with hot-deck imputation is the addition of random noise if the value is quantitative. The missing values were imputed through Hot-Deck using the R language package entitled “VIM” [72].

Predictive mean matching (PMM) imputation method

The predictive mean matching (PMM) method existed a long time ago [73, 74], but its widespread and practical applications began only recently. For the multiple imputation of the missing data, predictive mean matching (PMM) [73, 75] is considered a good method, typically when the quantitative features are involved that are not normally distributed. It is the state-of-the-art hot deck multiple imputation method [76]. The imputed values will be skewed, if the original feature values are skewed and bounded by some upper and lower limit e.g., 0 to 50 if the original feature is bounded by the limit. The reason is that imputed values are the original values that are borrowed from individuals with original data. The potential donees and donors, selected by either automatic distance-aided or nearest neighbor method, are matched in PMM by the closeness of predicted means. Considering each donor case, the predicted value for the incomplete case is compared to the fitted value obtained from some regression model. Moreover, in the classical PMM approach, a case is drawn from the pool of k cases whose estimated values are nearer to one of the value predicted for the missing case. Further, the missing value is imputed by the observed value of donor case. Initially, it was limited in usage i.e. only a single variable with missing data could be handled by PMM or, more broadly, its applications were limited to the situations where there existed monotonic missing data patterns. The PMM method has been embedded in various software packages that employ multiple imputation approaches, referred to as sequential generalized regression (SGR), fully conditional specification (FCS), or multiple imputation by chained equations (MICE). The quality of imputed values depends upon the availability of appropriate donor cases. In small datasets, the imputation by predictive mean matching could not give promising results, as there might not be suitable donor cases available. In the current study, missing values were imputed through PMM using the R language package entitled “mice” [62] with the parameters i.e. m (stands for ‘number of multiple imputations’), maxit (stands for ‘no of iterations’), method, and the seed of 5, 500, pmm, and 50 respectively. The ‘m’ with a value of 5 (considered to be enough [75] and also a default value) will generate five imputed datasets that differ only in imputed missing values. In classification or regression problems, the prediction models build upon these imputed datasets perform better by aggregating the prediction of these models. Considering the importance of the imputation process to reach convergence, a maximum number of iterations have been chosen i.e. 500. Generally, in the region of 20 to 30 or fewer iterations for each imputation are taken as a rule of thumb. Also, a random seed value of 50 is chosen for reproducibility.

Imputation by feature importance (IBFI) method

A pictorial representation of this new imputation method appears as Fig 3, while the coded algorithm itself appears in pseudo-code format below. The proposed method starts with the input data matrix (DM) which contains the different attributes (or quantities), specifically SRGC, thoron concentration, soil temperature, pressure, and humidity. DM contains different types of missingness (MNAR, MCAR, and MAR) of values of the attributes shown in Fig 2.

Fig 3

Proposed methodology to envelop base learning algorithm for imputation.

Pseudocode: as implemented, for the imputation by feature importance (IBFI) method. Split NAvector = indices(is.NA( TrainVector = Priority_Mat = For all the indices where Priority_Mat contain values of TrainVector, Set Priority_Mat[indices] = 0 For all the indices where Priority_Mat do not contain values of TrainVector, Set Priority_Mat[indices] = 1 if(Priority_Mat[n,j] = = 1) location_vector[l_c] = j l_c = l_c + 1 break indice_to_train = NA_vect max_value = max (location_vector) max_i = indices where (location_vector = = max_value) max_i = max_i[1] indice_to_train = NAvector[max_i] Traindata = Trainclass = trainF = Concatenate Column(traindata,class = trainclass) // to concatenate response variable with predictors ModelName = concatenate(indice_to_train, TrainVector) Flag = Fit a machine learning model i.e. BM Testdata = Val = predict(model,testdata) print("Model Hit") testdata = val = predict(Model_List[[flag]],testdata) k = k + 1 Flag = -1 Flag = 1 break Return flag In the very first step, the original DM is divided into two subsets: Pure Data (PD) and Impure Data (ID). PD contains those samples from the data matrix which do not have any missing values, while ID consists of those samples that had between one and a maximum of n missing values per sample. The IBFI uses Pure Data (PD) to fit a machine learning model for imputing missing values in Impure Data (ID). As values are imputed during the IBFI process, predicted missing values are continuously imputed to their respective locations and during iterations, completing the missing patterns based on the feature importance matrix (FIM). The feature importance matrix is computed by using the R package “randomForest” [77]. For each feature in the pure dataset (PD) by taking it as a response feature and others as the predictor, the feature importance of other features for predicting that response is computed. The features are arranged as per their importance value from highest to lowest. The imputation of missing values for a given attribute is next executed on the ID. This is done by applying a machine learning model trained using the non-missing attributes in pure data (PD) to predict the value in missing attributes of impure data (ID). Consider the different attributes F1, F2 ….F. If the missing value occurs in F1, the remaining attributes F2….F will be used for training any machine learning model and the resulting fitted model will be used to predict the missing values for F1. If the attribute F3, is missing, then the attributes F1, F2, F4 … F will be used for training any machine learning model and F3 will be predicted from that fitted model. These predicted values will serve in place of the missing values. The complexity increases when a sample has more than one missing value and certain features or attributes exhibit strong dependencies. Suppose DM contains the five attributes F1, F2, F3, F4, F5 and missing values occur in F1 and F5 of some samples as shown in Fig 5. Also, assume that certain attributes have a strong correlation with other attributes in the DM. For example, suppose F1 and F5 have a feature importance vector with other attributes from highest to lowest of F5, F3, F4, F2 and F2, F4, F1, F3 respectively. For those samples having F1 and F5 as missing values, conventionally F2, F3, F4, F5 and F1, F2, F3, F4 will be used for training the model. The imputation process becomes complicated for a machine learning model to impute the values for F1 and F5 when both have missing values in different samples. In this scenario, F2, F3, F4 are only attributes available for training the model because predicting the missing value of F5 needs the available value for F1 and vice versa. Moreover, to efficiently impute the value for F1 and F5, one must decide whether F1 or F5 must be imputed first. From the correlation vector above F5 is the most important attribute to predict the value of F1 whilst F2 is the most important attribute to predict the value of F5. Based on feature importance F5 needs to be imputed first and when the value of F5 is available then that value of F5 will be used to predict the value of F1. Initially, there may be missing values for multiple attributes. The best attribute to impute is first selected, and the missing values for that attribute are imputed. After this is completed, there is now one less attribute that has missing values. The best attribute to impute after that first process is completed is then determined, and its values are imputed. The process continues until there are no attributes that contain missing values.

Fig 5

Statistics for IBFI compared with other methods (mean, median, mode, PMM, Hotdeck) normalized to the average statistic for all methods averaged across different variables, showing RMSE, RMSLE, MAPE, and MSE.

The order of selection of attributes for imputation is determined by the FIM, which is determined by calculating the variable importance for each attribute of the data set and arranging the values in descending order. Because it is an iterative approach, IBFI requires a termination criterion. For this purpose, the number of missing values per sample, termed the rejection threshold, is selected. As Fig 3 shows there exist multiple features are missing at once in a sample e.g. F1 and F4, the rejection threshold is the extent to which we want to impute the number of missing values per sample. If the rejection threshold is 3 it means that if the number of missing values per sample is greater than 3, those samples will be rejected, and all the other samples get imputed in their respective scans. The methodology works in such a way that during the first iteration when keeping the rejection threshold of 3; all the samples having missing values of count 3 will be reduced to missing at 2 per sample as shown in Fig 4. After the second iteration, missing at 2 per sample will be reduced to missing at one per sample and further missing at one per sample is imputed and we got full imputed data other than those samples whose missingness count lies above the rejection threshold. Moreover, the proposed methodology uses model reusability to make it asymptotically better by storing the models which are fitted during subsequent iterations. The models are stored in such a way that if F1 is the dependent feature while F2 and F3 are independent features then model is stored in the memory as M123. In the subsequent iterations e.g. missing at 3 features is reduced to missing at 2 features and again F1 needs to be trained using F2 and F3; instead of fitting another model the same model M123 will be used to impute the value for F1. The sequence of imputation of values for the missing features in different samples differs from each other. The whole procedure is scanning the impure dataset as per the rejection threshold. Consider data row 1, the features F1 and F5 having a missing value while data row 100 have missing values at features F1, F2, and F5. Data row 1 has F2, F3, and F4 feature values available to predict the value of F1 and F5 whilst F3 and F4 have only available feature values to predict the missing value of F1, F2, and F5. During scan 1 and iteration 1, the feature importance matrix directs the algorithm to predict the missing value of F1 using the model fitted on features F2, F3, and F4 by considering F1 as a response and F2, F3, and F4 as predictor features. After predicting and imputing the missing value of F1, the model should store as M1234 for future patterns. Through the same iteration when scanning data row 100, the feature importance matrix directs the algorithm to impute the value of F2 which is missing by taking F2 as a response attribute while F3 and F4 as predictor attributes. After predicting and imputing the missing value of F2, the data row 100 has now an available value of F2 to predict other features. Now consider the second iteration and scan 100, the feature importance matrix directs the algorithm to impute the value of F1 first instead of F5. Currently, F2, F3, and F4 are available values to predict the missing value of F1. Instead of fitting another model to predict the missing value of F1, the previously generated M1234 model is reused and values of F2, F3, and F4 are passed to that model to predict the missing value of F1. The reusability of the already fitted model to impute the similar missing patterns enhance the performance and reduces the time and space requirements.

Results and discussion

The MAPE and PB statistics for all methods tested are shown in Tables 2 and 3 for all variables, radon (RN), thoron (TH), temperature (TC), relative humidity (RH), and pressure (PR), for 20% MCAR, MNAR, and MAR data. All methods had <0.03% MAPE for PR, expected as pressure variations are generally very small and regular in time. RN and TH had similar statistics for a given method, with IBFI performing the best compared to all other methods. As shown in this table, for 20% missingness the MAPE for IBFI for RN ranged between 0.50 and 0.53%, with this statistic being up to 1.8 times higher for Hotdeck. IBFI is similarly superior to all other methods for imputing TC values. The average MAPE for 20% MCAR was 0.8% compared to 1.3–3.5% for other methods. For TC, and 20% MNAR, IBFI had a MAPE of 0.5% compared to 0.8% to 2.9% for other methods. The 20% MAR data has similar results, namely 1% for IBFI compared to 1.4–3.9% MAPE for other methods. When PB is considered, the absolute percent bias for IBFI was lower than for all other methods and variables for 20% MNAR, MCAR, and MAR data with 20% missingness.

Table 2

MAPE and PB statistics for IBFI compared with other imputation methods (mean, median, mode, PMM, and Hotdeck) for 20% missingness of type MCAR and MAR and all parameters tested (RN, TH, TC, RH, and PR).

Method	Statistics	MCAR 20%					MNAR 20%
Method	Statistics	RN	TH	TC	RH	PR	RN	TH	TC	RH	PR
IBFI	MAPE	0.53%	0.48%	0.83%	0.24%	0.01%	0.52%	0.46%	0.54%	0.18%	0.01%
IBFI	PB	-0.04%	-0.05%	-0.18%	-0.01%	0.00%	0.31%	0.21%	0.12%	0.07%	0.01%
Mean	MAPE	0.69%	0.72%	2.78%	0.74%	0.02%	0.67%	0.75%	1.71%	0.47%	0.02%
Mean	PB	-0.05%	-0.11%	-1.38%	-0.18%	0.00%	0.42%	0.42%	0.46%	0.34%	0.01%
Median	MAPE	0.69%	0.71%	2.84%	0.75%	0.02%	0.63%	0.77%	1.70%	0.37%	0.02%
Median	PB	-0.13%	-0.05%	-1.56%	-0.36%	0.00%	0.35%	0.47%	0.38%	0.18%	0.01%
Mode	MAPE	0.72%	1.07%	3.01%	0.75%	0.03%	0.86%	1.42%	2.91%	0.30%	0.03%
Mode	PB	0.17%	0.98%	2.61%	-0.18%	0.02%	073%	1.40%	2.71%	-0.14%	0.03%
PMM	MAPE	0.81%	0.71%	1.28%	0.33%	0.02%	0.77%	0.67%	0.82%	0.26%	0.02%
PMM	PB	-0.09%	-0.11%	-0.24%	-0.01%	0.00%	0.32%	0.21%	0.10%	0.08%	0.01%
Hotdeck	MAPE	0.95%	0.98%	3.47%	0.96%	0.03%	0.88%	0.94%	2.24%	0.64%	0.03%
Hotdeck	PB	-0.02%	-0.14%	-1.34%	-0.13%	0.00%	0.45%	0.42%	0.40%	0.34%	0.01%

Table 3

MAPE and PB statistics for IBFI compared with other imputation methods (mean, median, mode, PMM, and Hotdeck) for 20% missingness of type MAR and all parameters tested (RN, TH, TC, RH, and PR).

Method	Statistics	MAR 20%
Method	Statistics	RN	TH	TC	RH	PR
IBFI	MAPE	0.50%	0.48%	0.97%	0.23%	0.01%
IBFI	PB	0.01%	-0.03%	-0.16%	-0.01%	0.00%
Mean	MAPE	0.65%	0.74%	3.08%	0.63%	0.02%
Mean	PB	0.17%	-0.15%	-2.28%	-0.24%	-0.01%
Median	MAPE	0.63%	0.73%	3.14%	0.66%	0.02%
Median	PB	0.09%	-0.09%	-2.37%	-0.41%	-0.01%
Mode	MAPE	0.75%	0.71%	2.82%	0.93%	0.02%
Mode	PB	0.44%	0.10%	2.38%	-0.90%	0.01%
PMM	MAPE	0.78%	0.68%	1.39%	0.36%	0.02%
PMM	PB	-0.02%	-0.06%	-0.28%	-0.03%	0.00%
Hotdeck	MAPE	0.92%	1.00%	3.93%	0.83%	0.03%
Hotdeck	PB	0.19%	-0.14%	-2.34%	-0.26%	-0.01%

Fig 5 shows the results when the statistics are normalized to the same statistic averaged over all of the methods, then averaged across all variables. The average statistics are calculated for all variables across different missingness scenarios such as MCAR, MNAR, and MAR. Firstly, the average of each performance metric among all imputation methods is calculated for the entire variables across different missingness scenarios with their associated missingness percentages. Secondly, each value in different missingness scenarios such as MCAR 10% is normalized with respect to different performance metrics by dividing with the corresponding average value calculated in step 1. This results in normalized values of each performance metric across different missingness scenarios for different missingness percentages. Thirdly, the averages of different metrics for all the variables are calculated with respect to imputation method. As can be observed in Fig 5, there is little difference as a function of data missingness type and degree for the RMSE, RMSLE, MAPE, and MSE statistics. IBFI is superior for all cases for these statistics. The percent bias (PB), as illustrated in Fig 6, appears to be somewhat dependent upon both the type and degree of missingness. For MCAR data, the PB for IBFI is similar to Hotdeck and PMM for 10% missingness, but superior for greater degrees of missingness (20%, 30%). With MNAR data, PB shows only positive bias. IBFI is similar to PMM for all degrees of missingness but better than the other methods for these. When missingness is MAR, mixtures of negative and positive PB are observed. IBFI has a similar PB to Hotdeck and Mean at 10% MAR, and to PMM at 10% and 20% MAR. For Mode >10% MAR missingness, IBFI shows lower PB than all methods.

Fig 6

The complete statistical results (RMSE, RMSLE, MAPE, MSE, PB) for IBFI compared with other methods (mean, median, mode, PMM, Hotdeck) for different types (MCAR, MNAR, MAR) and degrees (10%, 20%, 30%) of data missingness are provided in the supplementary materials. Fig 7A–7C shows the previously fitted model reusability in subsequent scans at missing completely at random with different missingness percentages ranging from 10 to 30 percent. On the X-axis there is a sample number in subsequent iterations while model hit rate and model creation is shown on the Y-axis. The proposed methodology uses model reusability by keeping the models which are fitted during subsequent iterations for future patterns. Fig 7A shows that with the processing of samples, fitted models are stored and in the subsequent samples those models are utilized that is represented with the black dotted line. Fig 7A–7C shows that with the advancement of samples the model hit rate increases rapidly and is shown in black dotted line. Model hit reflects the previous model’s reusability. While imputing each value, before model creation, the current formulated model directed by the feature importance matrix is searched in the model list. If it is found, the already fitted model should serve to predict the missing value at this stage. In the case when the model hit does not occur, the new model is fitted for the current formulation, and the fitted model is added to the model list which may use for further formulations. The point where the new model is created is shown as blue bubbles. As shown the model hit rate of the proposed methodology is getting higher with the advancement in the processing of samples. The model creation is just observed during the first few measurements and further that models were used for the prediction of missing values in upcoming samples. The reusability of models in such a way helps the proposed methodology to impute the missing patterns asymptotically better in terms of time and space. The same pattern was observed in Fig 7D–7I which shows the fitted model’s reusability statistics during the subsequent iterations (missing not random, missing at random). The model creation was observed just at the start of the imputation process as shown in red bubbles but the hit rate of these fitted models increases with the advancement of measurements. This reusability results in the efficient imputation of missing values for all the missing scenarios and makes it asymptotically better.

Fig 7

Previously fitted model reusability in subsequent scans with different missingness percentages ranging from 10 to 30 percent, for data missing, with data: a/b/c) completely at random, d/e/f) not at random, and g/h/i) at random.

Conclusion

Real-time series often contain missing values and missingness can arise for many possible reasons. The situation becomes very important when missingness induces bias in the forecasting model. In this article a methodology has been proposed that utilizes the feature importance and iteratively imputes the missing values in the time series data by incorporating any machine learning model e.g. XGBoost. The proposed methodology imputes various complex patterns of missingness and sets the rejection count that automatically rejects those samples whose number of missing values matches the count. Missing values patterns in the data have been simulated at different missing percentages ranging from 10 to 30 percent in terms of missing completely at random (MCAR), missing not at random (MNAR) and missing at random (MAR) scenarios. In this way, artificially missing value patterns have been introduced in different features. On imputing the same incomplete data, the proposed methodology outperforms than other frequently used methods such as mean, median, mode, predictive mean matching, and hot-deck imputation. Different statistical parameters, viz. RMSE, RMSLE, MAPE, and MSE, have been calculated and indicates that the proposed methodology-based results got very less error values when compared to other imputation methods at different missing scenarios of MCAR, MNAR, and MAR with the percentages of 10, 20, and 30 percent. The findings of the study show that the efficiency of the proposed methodology lies in the selection of the best predictor variable for different missingness patterns and the utilization of previously fitted models. The runtime decision of choosing the best and available predictor variables for different response variables results in the efficient development of machine learning model for imputing the values. As far as future directions are of concern, the application of the proposed methodology to other fields of research may be of interest such as electric load forecasting and medical databases. Imputation by feature importance (IBFI) can be extended to add class information while imputing supervised classification datasets. Summary of the results of the simulations at missing completely at random (MCAR) with the missingness percentage of a) 10%, b) 20%, and c) 30. (DOCX) Click here for additional data file. Summary of the results of the simulations at missing not at random (MNAR) with the missingness percentage of a) 10%, b) 20%, and c) 30. (DOCX) Click here for additional data file. Summary of the results of the simulations at missing at random (MAR) with the missingness percentage of a) 10%, b) 20%, and c) 30. (DOCX) Click here for additional data file. 13 Jul 2021 PONE-D-21-18941 Imputation by feature importance (IBFI): A methodology to envelop machine learning method for imputing missing patterns in time series data PLOS ONE Dear Dr. Rafique, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Aug 27 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Shamsuddin Shahid Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please provide the full raw data set and any relevant code as supplemental files. 3. Thank you for stating the following financial disclosure: "Muhammad Rafique Grant No: 6453/AJK/NRPU/R&D/HEC/2016 under NRPU scheme to principal investigator MR. www.hec.gov.pk The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript". We note that one or more of the authors is affiliated with the funding organization, indicating the funder may have had some role in the design, data collection, analysis or preparation of your manuscript for publication; in other words, the funder played an indirect role through the participation of the co-authors. If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please do the following: a. Review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study. These amendments should be made in the online form. b. Confirm in your cover letter that you agree with the following statement, and we will change the online submission form on your behalf: “The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes Reviewer #3: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: No Reviewer #2: Yes Reviewer #3: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This study presents a new methodology to impute or gap-fill missing data. The methodology broadly leverages the strength of correlations among sampled variables. It seems to be generic enough and can be utilized with any learning algorithm. Although the proposed methodology seems to be free from any technical flaws, I could not follow all aspects of the work. In particular, the model reusability component of the methodology is not clear to me and can benefit with more clarification. I also have a few more comments that are geared towards clarifying some aspects of the presentation. Therefore, I recommend that the manuscript be subject to moderate to major revisions. There are also places where the language of the manuscript includes errors. I have marked a few lines in the comments which especially caught the eye. I recommend the authors do a thorough reading of the manuscript before resubmitting. Detailed comments: 1. The abstract and introduction talk about the importance of soil radon gas concentration (SRGC). Following this, the expectation is that the work will focus on imputation of SRGC data. However, the methodology is tested on imputation of five different variables: Radon, Thoron, Temperature, Relative Humidity, and Pressure. Therefore, the introduction should be revised or expanded to motivate the importance of all five variables considered in the study, and not just radon. 2. Introduction, line 68: “with the exact mechanism being unimportant for classification”. It is not clear what “classification” is being referred to here. 3. Introduction, line 72: “The nature of absent data, or missingness, can be classified in three ways”. I suggest to cite some references for this statement. E.g., works by Rubin [1] and Buuren [2]. 4. Introduction, lines 79-93: The authors review several imputation approaches. However, the review lacks completeness. First, it would be useful if the various approaches that are reviewed by the authors can be explained in a sentence or two. Second, while the cons of simple approaches are clearly outlined, other approaches are not adequately discussed. For instance, the authors mention the benefits of multiple imputation and stochastic regression methods, but their shortcomings are not mentioned. Third, the authors mention that machine learning tends to outperform traditional statistical methods. Which of the methods reviewed in the literature fall under the realm of traditional statistical methods? Finally, given the methodology utilizes correlations among predictor variables to model a response variable, it is recommended to briefly review some relevant recent work. For instance, Mital et al. [3] and Sahu et al. [4] investigated the impact of selecting highly correlated input features for modeling/imputing a response variable. 5. Materials and Methods; Instrumentation and location: I suggest that the authors provide a figure showing the location of the data used in this study. While it is not strictly necessary, it helps make the presentation more complete. 6. Materials and Methods, lines 128-129: “The missing values are introduced into the dataset artificially by the R package entitled mice”. I recommend that a brief description of how the missing values were inserted should be provided. Simply stating that the missing values were inserted using a package sounds opaque and is not sufficient. How does the package insert values that are consistent with the three different missingness patterns? What are the mechanics of inserting those missing values? 7. Material and Methods, lines 189-191: For kernel density estimation, what kernel is used? Is it the normal kernel or something else? Please clarify in the text. 8. Hot deck imputation method, lines 194-202: The description of the method is loaded with jargon that may not make much sense to a reader unfamiliar with this method. For instance, what is a “responding unit”? What is a “practical response”? I suggest that the description be re-worded and made more accessible. 9. Predictive mean matching imputation method, lines 204-216: Again, the method description is not clear. The context of sentences on lines 210-213 is not clear to an uninitiated reader (such as myself). Furthermore, on lines 215-216, the authors list the parameters used in the method. It is not clear what these parameters mean and how these values were chosen. 10. Pseudo code, line 226: The current presentation of the pseudo code seems very complex and could benefit with simplification. In particular, it seems to use the syntax and functions used in programming language R. I recommend that the code be revised to make it more readable for someone who is not familiar with R. 11. Line 249: What is a “sample” and a “value” in the context of this work? It seems that one sample refers to a measurement of all five attributes (values). The terminology should be clarified and made consistent across the manuscript. For instance, the terms “values” and “attributes” have been used interchangeably. 12. Lines 268-270: Please rephrase to correct the grammar. 13. Lines 291-303: Please rephrase to correct the grammar. 14. Lines 296-303: “The proposed methodology uses model reusability”. The description of model reusability is not clear to me. Specifically, in the scenario described in lines 298-303, it is not clear to me why F1 needs to be trained again using F2 and F3 during subsequent iterations. 15. Concerning feature importance: Is feature or variable importance quantified using correlations? If so, please clarify. 16. Variables RN, TH, TC, RH and PR: Please define these abbreviations. 17. Fig 4: Overall, I really like this figure. It helps the reader to understand all the results qualitatively. However, I did not follow how the normalization was done quantitatively. Please clarify by either rephrasing or perhaps giving an example of normalization. There is also one minor typo in the y-axis labels (“Mean: R 18. Fig 6: I did not understand the results in this figure since the concept of model reusability was not clear to me (see comment 14 above). Furthermore, the terms “model hit rate” and “model creation” have not been defined. 19. Lines 364-368: Please rephrase to correct the grammar. 20: Concerning “rejection threshold” or “rejection count”: How do the authors pick an appropriate value of the rejection threshold? Why did the authors pick a value of 3? If the number of attributes for a sample is 5, I would assume that a value of 4 may also work. Also, I would recommend keeping the terminology consistent to avoid ambiguity, i.e., use either the phrase “rejection threshold” or “rejection count”. 21: Concerning “Keywords”: I suggest the authors revise the keywords for the manuscript. The “Naïve Bayes” classifier and “Random Forests” are not used in this work and should not be used as keywords. References: 1. Rubin DB. Inference and missing data. Biometrika. 1976;63: 581–592. 2. Buuren S van. Flexible imputation of missing data. Second edition. Boca Raton: CRC Press, Taylor & Francis Group; 2018. 3. Mital U, Dwivedi D, Brown JB, Faybishenko B, Painter SL, Steefel CI. Sequential Imputation of Missing Spatio-Temporal Precipitation Data Using Random Forests. Front Water. 2020;2: 20. doi:10.3389/frwa.2020.00020 4. Sahu RK, Müller J, Park J, Varadharajan C, Arora B, Faybishenko B, et al. Impact of Input Feature Selection on Groundwater Level Prediction From a Multi-Layer Perceptron Neural Network. Front Water. 2020;2: 573034. doi:10.3389/frwa.2020.573034 Reviewer #2: Abstract: The authors need to revise it to highlight the problem, findings and novelty of their work. Introduction: The literature review in this section needs to be updated with recent published works relevant to the study. The authors should improve the problem statement and highlight the objectives of the study clearly and mention the main contribution in the study. Material and Methods: More explanation about the importance of the proposed model to solve the current problem. Results and discussion: The authors are encouraged to add more explanations to the findings and justify it clearly. Conclusion: The authors are advised to re-write this section to justify the findings and suggest future work to be carried in terms of deploying the models or proposing way to enhance it. References: The authors missed out recent references related to their works which they are encouraged to included in their revised version Reviewer #3: Review Report of Imputation by feature importance (IBFI): A methodology to envelop machine learning method for imputing missing patterns in time series data Although the paper may present a new methodology, it is badly written. My recommendation is “major revision”. The comments: The author mentioned in the abstract, introduction, and conclusion that they have used XGBoost, however; the XGBoost was not mentioned once in the methodology section. Yes, it is there in Figure 2 but it was not found in the text. Thus, it is not clear the role that XGBoost played in the proposed framework. I don’t understand why the authors employed several statistical metrics which measure the same characteristics. For example, RMSE and MSE, and RMSLE are somehow the same. In a matter of fact, the author showed that RMSE and MSE have several disadvantages. The authors should use only one of them like RMSLE and remove the remaining. The others make the paper longer for no added information. The same goes for MAPE and PB, I encourage the authors to pick only one of them. This will reduce the paper length and will give more focus to the new framework. Also, I feel it is not fair to compare the new framework to conventional data filling techniques. Obviously, the new framework will be better. I encourage the authors to add a random forest or support vector machine model to the comparison which may increase the work strength. A brief description of the data should be given. Yes, it may be described elsewhere. I think a very brief statistical description of the data itself is also required here. The introduction is badly written. It doesn’t follow a line of ideas. Many ideas are repeated here and there in the introduction section. Please, review it once more. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: Yes: Mohamed Salem Nashwan [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 27 Aug 2021 Dear Editor and Editorial Staff, PLOS ONE We are pleased to inform you that we have revised the manuscript in the light of reviewers’ comments. The anonymous reviewers’ recommendations were extremely useful and we have addressed all of their recommendations in the revised manuscript. Please see below for responses to each individual comment. Reviewer #1: Comment The abstract and introduction talk about the importance of soil radon gas concentration (SRGC). Following this, the expectation is that the work will focus on imputation of SRGC data. However, the methodology is tested on imputation of five different variables: Radon, Thoron, Temperature, Relative Humidity, and Pressure. Therefore, the introduction should be revised or expanded to motivate the importance of all five variables considered in the study, and not just radon. Response We have incorporated necessary changes in introduction in order to address reviewers concern. As meteorological parameters have great influence on radon emission dynamics that can be used as a precursor for earthquake, reference material is added in the introduction to highlight the importance of variables other than radon along with the methodology to impute all these variables. Comment Introduction, line 68: “with the exact mechanism being unimportant for classification”. It is not clear what “classification” is being referred to here. Response We have revised the line as pointed out by respected reviewer. The term “classification” here refers to the categories of missingness such as missing completely at random, missing not at random and missing at random. The incomplete or missing data can be classified according to the mechanism through which it generates such as MCAR if missing value is generated by human or machine error. If the cause of missingness is related to observed data but not the missing data, it is classified as MAR while if it is missing because of other observed variables or some hypothetical value, it is MNAR. Comment Introduction, line 72: “The nature of absent data, or missingness, can be classified in three ways”. I suggest citing some references for this statement. E.g., works by Rubin [1] and Buuren [2]. Response The references provided by the respected reviewers have been added in the manuscript. Comment Introduction, lines 79-93: The authors review several imputation approaches. However, the review lacks completeness. First, it would be useful if the various approaches that are reviewed by the authors can be explained in a sentence or two. Second, while the cons of simple approaches are clearly outlined, other approaches are not adequately discussed. For instance, the authors mention the benefits of multiple imputation and stochastic regression methods, but their shortcomings are not mentioned. Third, the authors mention that machine learning tends to outperform traditional statistical methods. Which of the methods reviewed in the literature fall under the realm of traditional statistical methods? Finally, given the methodology utilizes correlations among predictor variables to model a response variable, it is recommended to briefly review some relevant recent work. For instance, Mital et al. [3] and Sahu et al. [4] investigated the impact of selecting highly correlated input features for modeling/imputing a response variable. Response Agreed. In addition to the simple approaches towards imputation tasks, the discussions about other imputation methods such as multiple imputations have been provided. The shortcomings of imputation methods are also provided in the manuscript to address the reviewer concern. We have also reviewed and added material regarding several machine learning methods for imputing missing data such as sequential imputation using Random Forest for imputing missing values in spatio-temporally daily time series precipitation records. Comment Materials and Methods; Instrumentation and location: I suggest that the authors provide a figure showing the location of the data used in this study. While it is not strictly necessary, it helps make the presentation more complete. Response Agreed. We have added the figure regarding instrumentation and location in the manuscript with caption “Fig 1. Soil radon measuring station located inside 150 km from the epicenter of the strongest earthquake since 1900 with the latitude, longitude of 34.396210 and 73.473470 respectively”. Comment Materials and Methods, lines 128-129: “The missing values are introduced into the dataset artificially by the R package entitled mice”. I recommend that a brief description of how the missing values were inserted should be provided. Simply stating that the missing values were inserted using a package sounds opaque and is not sufficient. How does the package insert values that are consistent with the three different missingness patterns? What are the mechanics of inserting those missing values? Response In order to address the reviewer concern, we have incorporated detailed mechanics of inserting missing values in the dataset for three different missingness scenarios such as MCAR, MAR and MNAR. The core idea to introduce the missing values, using R package “mice”, in the multivariate dataset lies in the missing patterns which are the mixture of variables with missing values and variables with available values. The missing patterns with its frequency are shown in Fig 3. The complete dataset is divided into k subsets randomly based upon k missing data patterns. The subset size depends upon the frequency vector which is the frequency of the certain pattern to be missing the complete dataset. The data rows in the subsets are considered to be a candidate for missing is based upon several factor such as missingness mechanism (MCAR, MNAR and MAR). In MCAR scenarios, all the data rows in the subsets have the equal probability of being missing while in MNAR and MAR scenarios, the so-called weighted sum scores are computed. More simply put, the weighted sum scores are the outcome of a linear regression equation and these scores provides basis for candidates data rows to be missing or not. Finally, the data rows in the subsets are made missing or incomplete according to the missing data pattern along with its probability for being missing. After the introduction of missing values, these subsets are merged to make incomplete dataset having missing values in different data rows. Comment Material and Methods, lines 189-191: For kernel density estimation, what kernel is used? Is it the normal kernel or something else? Please clarify in the text. Response As the data used in this study is continuous time series data, kernel density estimation is used to produce a continuous estimate of probability density function. The point at which that function reaches its maximum is considered the mode. For kernel density estimation, R package “stats” [69] is used. The function “density ()” with its default parameters are used to calculate the kernel density estimate. The function “density.default ()” uses the algorithm that first use a regular grid of at least 512 points to disperse the mass of the empirical distribution function. The fast Fourier transform is used to convolve this approximation along with the discretized version of the kernel such as Gaussian. Finally, the density at specified points is evaluated using linear approximation. Comment Hot deck imputation method, lines 194-202: The description of the method is loaded with jargon that may not make much sense to a reader unfamiliar with this method. For instance, what is a “responding unit”? What is a “practical response”? I suggest that the description be re-worded and made more accessible. Response Agreed. We have incorporated necessary changes in manuscript in order to address reviewers concern. We have re-worded the text to make it more accessible and easy to understand even for unfamiliar reader. Comment Predictive mean matching imputation method, lines 204-216: Again, the method description is not clear. The context of sentences on lines 210-213 is not clear to an uninitiated reader (such as myself). Furthermore, on lines 215-216, the authors list the parameters used in the method. It is not clear what these parameters mean and how these values were chosen. Response Agreed. We have incorporated necessary changes in manuscript in order to address reviewers concern. We have re-worded the text to make it more accessible and easy to understand. Comment Pseudo code, line 226: The current presentation of the pseudo code seems very complex and could benefit with simplification. In particular, it seems to use the syntax and functions used in programming language R. I recommend that the code be revised to make it more readable for someone who is not familiar with R. Response Agreed. We have rewrite the pseudo code to somewhat algorithmic style and easy to interpret for all type of audiences even if they do not have the knowledge and syntax of R programming language. Comment Line 249: What is a “sample” and a “value” in the context of this work? It seems that one sample refers to a measurement of all five attributes (values). The terminology should be clarified and made consistent across the manuscript. For instance, the terms “values” and “attributes” have been used interchangeably. Response Agreed with the respected reviewer. The sample is the measurement of all five attributes or variables while value is the single measurement for different attributes such as temperature has the value of 38.5oC. In order to address the reviewer concern, we have made the terms consistent throughout the script in the manuscript. Comment Lines 268-270: Please rephrase to correct the grammar. Response We have incorporated necessary changes in manuscript in order to address reviewers concern. Comment Lines 291-303: Please rephrase to correct the grammar. Response We have incorporated necessary changes in manuscript in order to address reviewers concern. Comment Lines 296-303: “The proposed methodology uses model reusability”. The description of model reusability is not clear to me. Specifically, in the scenario described in lines 298-303, it is not clear to me why F1 needs to be trained again using F2 and F3 during subsequent iterations. Response As the proposed methodology imputes the attributes or variables using feature importance matrix, the sequence of imputation of values for different samples differs for each other. The whole procedure is scanning the impure dataset with respect to the rejection threshold. Consider sample 1, the missing pattern have attribute F1 and F5 having missing value in it and sample 100 have missing values at attributes F1, F2 and F5. For sample 1, F2, F3 and F4 have values available to predict the appropriate value for F1 and F5 while F3 and F4 have only available values to predict for F1, F2 and F5. During scan 1 and iteration 1, the feature importance matrix directs the algorithm to predict the value for F1 using F2, F3 and F4 by taking F1 as a response attribute and F2, F3 and F4 as predictor attributes. After predicting and imputing the value of F1, the model should store as M1234 for future patterns. During the same scan when reached at sample no. 100, feature importance matrix directs the algorithm to impute the value of F2 which is missing by taking F2 as a response attribute while F3 and F4 as predictor attributes. After predicting and imputing the value of F2, the sample no. 100 has now an available value of F2 to predict other attributes. Now consider second iteration and scan no. 100, the feature importance matrix directs the algorithm to impute the value of F1 first instead of F5. Now, F2, F3 and F4 are available values to predict the value of F1. Instead of fitting another model to predict the value of F1, the previously generated M1234 model is reused and values of F2, F3 and F4 are passed to that model in order to predict the value of F1. This reusability enhances the performance of proposed methodology and reduces the computation to fit another machine learning model. Comment Concerning feature importance: Is feature or variable importance quantified using correlations? If so, please clarify. Response The feature importance matrix is computed by using R package “randomForest”. For each feature in the dataset by taking it as response feature and others as predictor, the feature importance of other features for predicting that response is computed. The features are arranged as per their importance value from highest to lowest. The clarification of this step is incorporated in the manuscript in order to address the reviewer concern. Comment Variables RN, TH, TC, RH and PR: Please define these abbreviations. Response RN, TH, TC, RH and PR stands for Radon, Thoron, Temperature, Relative Humidity and Pressure. We have also incorporated their full forms in the revised manuscript. Comment Fig 4: Overall, I really like this figure. It helps the reader to understand all the results qualitatively. However, I did not follow how the normalization was done quantitatively. Please clarify by either rephrasing or perhaps giving an example of normalization. There is also one minor typo in the y-axis labels (“Mean: R 18. Fig 6: I did not understand the results in this figure since the concept of model reusability was not clear to me (see comment 14 above). Furthermore, the terms “model hit rate” and “model creation” have not been defined. Response Agreed. As far as figure 4 is of concern, the average statistics are calculated for all variables across different missingness scenarios such as MCAR, MNAR and MAR. Firstly, the average of each performance metric among all imputation methods is calculated for the entire variables across different missingness scenarios with their associated missingness percentages. Secondly, each value in different missingness scenarios such as MCAR 10% is normalized with respect to different performance metric by dividing with the corresponding average value calculated in step 1. Now we have the normalized value of each performance metric across different missingness scenarios for different missingness percentages. Thirdly, the averages of all the variables are calculated with respect to imputation method. Finally, the average statistics of all the performance metrics across all variables are presented in Figure 4. We have corrected the typo in y-axis label to address the reviewer concern. The detailed answer for model reusability is provided in the comment above. The “model hit rate” and “model creation” is now defined in the manuscript. Model hit reflects the previous model reusability. While imputing each value, before model creation, the current formulated model directed by feature importance matrix is searched in the model list. If it is found, the already fitted model should serve to predict the missing value at this stage. In the case when the model hit is not occurred, the new model is fitted for current formulation and the fitted model is added to the model list which may use for further formulations. Comment Lines 364-368: Please rephrase to correct the grammar. Response We have incorporated necessary changes in manuscript in order to address reviewers concern. Comment Concerning “rejection threshold” or “rejection count”: How do the authors pick an appropriate value of the rejection threshold? Why did the authors pick a value of 3? If the number of attributes for a sample is 5, I would assume that a value of 4 may also work. Also, I would recommend keeping the terminology consistent to avoid ambiguity, i.e., use either the phrase “rejection threshold” or “rejection count”. Response Agreed. In order to avoid ambiguity, we have reworded the phrase to “rejection threshold” instead of “rejection count” in the manuscript. The rejection threshold controls the extent to which the numbers of attribute values are missing in different samples. If there is an increasing number of missing values in different samples, there is chance that imputing the values may result in contamination of those samples and results in poor analyses or biased result on further experimentation using that imputed dataset. Although, the rejection threshold can be 4 and it works but in our case, we have total of 5 attributes. The value of 3 is selected based upon the assumption that at least there should be two attributes available that can serve as the predictor attributes to get more accurate results. Comment Concerning “Keywords”: I suggest the authors revise the keywords for the manuscript. The “Naïve Bayes” classifier and “Random Forests” are not used in this work and should not be used as keywords. Response Agreed. We have incorporated necessary changes in manuscript in order to address reviewers concern. Reviewer #2: Comment Abstract: The authors need to revise it to highlight the problem, findings and novelty of their work. Response We have revisited all the concerns that were mentioned by the respected reviewer and also responded in detailed form in the revised manuscript. We have provided the more detailed description of the scenarios where the proposed methodology can be beneficial. Basically, this method is more useful in the scenarios where more than one missing values occurs in different samples. In order to predict the missing value for that sample, only a single model based upon all the other predictor variables is not enough. This is due to the fact that the prediction of those missing values in certain samples requires all the available values to predict it using any machine learning method. Using this method, one can impute the missing values in whole dataset automatically using any base machine learning algorithm without taking care of how much missing values occurs in different samples. The number of imputations in different samples is controlled by the user of this methodology. Comment Introduction: The literature review in this section needs to be updated with recent published works relevant to the study. The authors should improve the problem statement and highlight the objectives of the study clearly and mention the main contribution in the study. Response Agreed. This is also pointed out by other respected reviewers also. To address this concern, we have restructured the introduction part by providing more related work for missing data imputation. We have also provided cons of other imputation methods used in this paper. Comment Material and Methods: More explanation about the importance of the proposed model to solve the current problem. Response We have revisited all the concerns that were mentioned by the respected reviewer and also responded in detailed form in the revised manuscript. We have described the parameters and also provided the detailed information of proposed methodology as well as other imputation methods used in this study. We have provided deeper look towards model reusability and its effectiveness in the imputation process regarding proposed methodology. Comment Results and discussion: The authors are encouraged to add more explanations to the findings and justify it clearly. Response We have revisited all the concerns that were mentioned by the respected reviewer and also responded in detailed form in the revised manuscript. Comment Conclusion: The authors are advised to re-write this section to justify the findings and suggest future work to be carried in terms of deploying the models or proposing way to enhance it. Response We have rewritten this section as per directions of the respected reviewer. We have also provided future work that can be carried out using proposed methodology. The scenarios where it can be beneficial for missing values imputation are also discussed. Comment References: The authors missed out recent references related to their works which they are encouraged to included in their revised version Response We have revisited all the concerns that were mentioned by the respected reviewer and also responded in detailed form in the revised manuscript. The related references are added to the revised manuscript. Reviewer #3: Comment The author mentioned in the abstract, introduction, and conclusion that they have used XGBoost, however; the XGBoost was not mentioned once in the methodology section. Yes, it is there in Figure 2 but it was not found in the text. Thus, it is not clear the role that XGBoost played in the proposed framework. Response We have revisited all the concerns that were mentioned by the respected reviewer and also responded in detailed form in the revised manuscript. We have provided the role of XGBoost for using it as base learning algorithm in proposed methodology. Apart from XGBoost, using this methodology by employing any base learning algorithm of our choice, one can impute the missing values in whole dataset automatically without taking care of how much missing values occurs in different samples. The number of imputations in different samples is controlled by the user of this methodology. Comment I don’t understand why the authors employed several statistical metrics which measure the same characteristics. For example, RMSE and MSE, and RMSLE are somehow the same. In a matter of fact, the author showed that RMSE and MSE have several disadvantages. The authors should use only one of them like RMSLE and remove the remaining. The others make the paper longer for no added information. The same goes for MAPE and PB; I encourage the authors to pick only one of them. This will reduce the paper length and will give more focus to the new framework. Response The reason behind the use of multiple metrics for performance evaluation because the imputation of missing values is carried out for all the variables (radon, thoron, temperature, relative humidity and pressure) in the soil gas radon concentration time series dataset and all the variables has different type of scales. The radon concentration time series with the minimum of 13743 to maximum of 28085 Bq/m3 while temperature time series ranges from 4 to 42.5 0C in the dataset. The resultant prediction error for radon and thoron is much larger when considering against temperature and relative humidity. RMSLE doesn’t penalize large errors. It is usually used when we don’t want to influence the results if there are large errors. RMSLE penalize lower errors. When actual and predicted values are low, RMSE & RMSLE are usually same. MAPE is used to account for attributes such as radon and thoron because of its less biasness towards higher values. Finally, the PB is used to account for overestimation or underestimation bias. The PB tells more insights apart from RMSE, RMSLE, MSE and MAPE. Thus, instead of relying on a single performance evaluation metric, multiple evaluation metric or loss functions are used to accommodate for all the variables in the dataset. Moreover, the resultant values for all the performance evaluation metrics are provided as supplementary material with the manuscript. The normalized values of these performance evaluation metrics are provided in the form of figures in the manuscript. Comment Also, I feel it is not fair to compare the new framework to conventional data filling techniques. Obviously, the new framework will be better. I encourage the authors to add a random forest or support vector machine model to the comparison which may increase the work strength. Response Agreed with the respected reviewer. We have compared the new framework with the conventional data filling techniques. The basic idea behind using the conventional data techniques lies upon the fact that we have introduced the missing values in all the five variables. The introduction of missing values in five variables leads to different patterns of missingness and multiple variables ranges from 1 to the number of variables may be missing per sample. In order to predict the missing value for those samples having more than one missing values, only a single raw model (such as random forest and support vector machines) based upon all the other predictor variables is not enough. This is due to the fact that the prediction of those missing values in certain samples requires all the available values to predict it using any machine learning method. Thus, we have a need of robust iterated methodology that can envelop these basic machine learning models to get adapted as per missingness scenarios in different samples. This methodology make the basic machine learning method to be able to impute more than one missing values in different samples automatically by using feature importance matrix without taking care of how much missing values occurs in different samples. The number of imputations in different samples is controlled by the user of this methodology. Comment A brief description of the data should be given. Yes, it may be described elsewhere. I think a very brief statistical description of the data itself is also required here. The introduction is badly written. It doesn’t follow a line of ideas. Many ideas are repeated here and there in the introduction section. Please, review it once more. Response We have revisited all the concerns that were mentioned by the respected reviewer and also responded in detailed form in the revised manuscript. We have provided the statistical details of the data in the form of table in the revised manuscript. The introduction part is restructured in the revised manuscript and maintains the flow of ideas as well as recent work regarding imputation of missing data is also provided. Kind Regards Prof. Dr. M. Rafique Submitted filename: Response to Reviewers.docx Click here for additional data file. 3 Nov 2021 PONE-D-21-18941R1Imputation by feature importance (IBFI): A methodology to envelop machine learning method for imputing missing patterns in time series dataPLOS ONE Dear Dr. Rafique, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Dec 18 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Shamsuddin Shahid Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: No ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors have presented a revised version of their manuscript. Overall, I do not think the manuscript is ready for acceptance yet. While I do not question the methodology and the results, the presentation and text need to be thoroughly proofread for clarity and grammar before the manuscript can be considered to be of publication quality. Although the authors have sought to address all my comments, the responses lack clarity and, at times, were accompanied by poor grammar in the manuscript. I had to re-read the responses multiple times to understand them and had to eventually guess an explanation that made the most sense. Care needs to be taken to ensure that the language used in the manuscript is precise. Given that PLOS ONE does not copyedit accepted manuscripts, this is very important. 1. Please define SRGC in the introduction when it is mentioned for the first time. 2. Response 2, lines 93-94: Please grammar-check 3. Response 4: please grammar-check the description of various imputation methods that have been added to the introduction 4. Response 6: description of missingness mechanisms lacks clarity; please proofread 5. Response 8: there is reference to PMM in the description of hot-deck imputation. This adds confusion since PMM is described in the next section. Further, the comment asked the authors to rephrase the description to remove jargon. However, the description is now too detailed which ultimately still falls short of explaining the method adequately (e.g, how is the donor pool picked?). An effort needs to be made to keep the description short by providing only the relevant information. 6. Response 9: this comment has not been addressed adequately. The mechanics of PMM have not been explained, nor have the parameters been described. 7. Response 10: The pseudo-code is still difficult to follow. 8. Response 14: The description of model reusability needs to be improved for readability, and then inserted in the manuscript. Also, while I agree that reusability reduces computation time, I am not sure how it results in more accurate predictions. Finally, I assume that “pure data” PD is used to create models. If so, that should be clarified. Presently, the purpose of PD has not been explained explicitly anywhere in the text. 9. Response 17: In Figure 7 of the revision, what is the measurement number? You mentioned "scans" and "iterations" in response 14. How does it relate to that? 10. Response 19: Please state your assumption justifying the use of a rejection threshold of 3 in the manuscript. 11. It seems that PLOS data policy requires data underlying the findings to be fully available, which includes the data points behind the summary statistics. I only see the summary statistics in the supplement. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 20 Nov 2021 Dear Editor and Editorial Staff, PLOS ONE We are pleased to inform you that we have revised the manuscript in light of reviewers’ comments. The anonymous reviewers’ recommendations were extremely useful and we have addressed all of their recommendations in the revised manuscript. Please see below for responses to each comment. Reviewer #1: Comment Please define SRGC in the introduction when it is mentioned for the first time. Response We have incorporated necessary changes in the introduction to address reviewers concern. Comment Response 2, lines 93-94: Please grammar-check Response We have proofread as well as grammatically checked the lines 93-94 in the revised manuscript. Comment Response 4: please grammar-check the description of various imputation methods that have been added to the introduction Response We have proofread as well as grammatically checked the description of various imputation methods in the revised manuscript. Comment Response 6: description of missingness mechanisms lacks clarity; please proofread Response To address the reviewer's concern, we have restructured and clarified the concept in a more precise and easy way. Comment Response 8: there is reference to PMM in the description of hot-deck imputation. This adds confusion since PMM is described in the next section. Further, the comment asked the authors to rephrase the description to remove jargon. However, the description is now too detailed which ultimately still falls short of explaining the method adequately (e.g, how is the donor pool picked?). An effort needs to be made to keep the description short by providing only the relevant information. Response Agreed with the findings of the respected reviewer. The reference in the hot-deck imputation section adds confusion when reading the description of the PMM imputation method. To address the concern, we have added relevant details of each imputation method and rephrased and proofread as well. Comment Response 9: this comment has not been addressed adequately. The mechanics of PMM have not been explained, nor have the parameters been described. Response To address this concern, we have explained the mechanics of PMM and its parameters in a more precise and easy way so that the concept can be easily figured out. Comment Response 10: The pseudo-code is still difficult to follow. Response To address the reviewer's concern, we have added comments and restructured some statements in the pseudo-code in a way that can be easily followed. Comment Response 14: The description of model reusability needs to be improved for readability, and then inserted in the manuscript. Also, while I agree that reusability reduces computation time, I am not sure how it results in more accurate predictions. Finally, I assume that “pure data” PD is used to create models. If so, that should be clarified. Presently, the purpose of PD has not been explained explicitly anywhere in the text. Response The proposed methodology imputes the attributes or variables using a feature importance matrix. The sequence of imputation of values for the missing features in different samples differs from each other. The whole procedure is scanning the impure dataset as per the rejection threshold. Consider data row 1, the features F1 and F5 having a missing value while data row 100 have missing values at features F1, F2, and F5. Data row 1 has F2, F3, and F4 feature values available to predict the value of F1 and F5 whilst F3 and F4 have only available feature values to predict the missing value of F1, F2, and F5. During scan 1 and iteration 1, the feature importance matrix directs the algorithm to predict the missing value of F1 using the model fitted on features F2, F3, and F4 by considering F1 as a response and F2, F3, and F4 as predictor features. After predicting and imputing the missing value of F1, the model should store as M1234 for future patterns. Through the same scan when reached at data row 100, the feature importance matrix directs the algorithm to impute the value of F2 which is missing by taking F2 as a response attribute while F3 and F4 as predictor attributes. After predicting and imputing the missing value of F2, the data row 100 has now an available value of F2 to predict other features. Now consider the second iteration and scan 100, the feature importance matrix directs the algorithm to impute the value of F1 first instead of F5. Currently, F2, F3, and F4 are available values to predict the missing value of F1. Instead of fitting another model to predict the missing value of F1, the previously generated M1234 model is reused and values of F2, F3, and F4 are passed to that model to predict the missing value of F1. The reusability of the already fitted model to impute the similar missing patterns enhance the performance and reduces the time and space requirements. The accuracy of predicting the missing value for different features is based upon the feature importance matrix and runtime decision of choosing the best available features for training while model reusability results in reducing the computation time. Agreed with the respected reviewer. The pure data (PD) has all the feature values available to fit a machine learning model. To address this concern, we have explicitly explained the usage of PD in imputation by feature importance (IBFI). Comment Response 17: In Figure 7 of the revision, what is the measurement number? You mentioned "scans" and "iterations" in response 14. How does it relate to that? Response In IBFI, “scan” traverses the whole impure data by scanning each data row to check for missing patterns and selection of best available features for training. On the other hand, “iteration” consists of one full scan of the impure data. During each iteration, the number of missing features per data row gets reduced by one feature. In figure 7, the measurement number represents the data rows handled for imputing missing patterns by IBFI during the subsequent scans and iterations. Comment Response 19: Please state your assumption justifying the use of a rejection threshold of 3 in the manuscript. Response If there is an increasing number of missing values in different data rows, there is the chance that imputing the values may result in contamination of those data rows and result in poor analyses or biased results on further experimentation using that imputed dataset. The dataset used in this study has 5 features. The assumption behind the use of a rejection threshold of 3 is to ensure that at least 2 features are available for model training. The imputation of more than 3 out of 5 features may result in a relatively less accurate imputation of missing value because of the use of previously imputed value for the prediction of the other missing feature values. Comment It seems that PLOS data policy requires data underlying the findings to be fully available, which includes the data points behind the summary statistics. I only see the summary statistics in the supplement. Response The data that support the findings of this study are available from the corresponding author upon reasonable request. Sincerely, Prof. Dr. M. Rafique Submitted filename: Response to Reviewers.docx Click here for additional data file. 19 Dec 2021 Imputation by feature importance (IBFI): A methodology to envelop machine learning method for imputing missing patterns in time series data PONE-D-21-18941R2 Dear Dr. Rafique, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Shamsuddin Shahid Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: All technical issues have been addressed. Concerning the use of rejection threshold of 3, I suggest that the authors also state their justification explicitly in the manuscript. I also suggest that the authors explicitly state in the manuscript what they mean by "scans" and "iterations". This information was provided in the response document but I could not find it explicitly stated in the manuscript. The other comment is about the lack of data availability. The authors have only provided the summary statistics and not the actual data points. I leave it up to the editor to adjudicate whether this satisfies the PLOS Data Policy. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No 23 Dec 2021 PONE-D-21-18941R2 Imputation by feature importance (IBFI): A methodology to envelop machine learning method for imputing missing patterns in time series data Dear Dr. Rafique: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Shamsuddin Shahid Academic Editor PLOS ONE

Table 1

Statistical details of the SRGC time series dataset.

Variable	Mean	StDev	Minimum	Q1	Median	Q3	Maximum	Skewness
Radon	21364	2130	13743	19950	21569	22876	28085	-0.31
Thoron	2515.3	384.3	1495.0	2246.3	2489	2761	16182	4.26
Temperature	22.485	8.085	4	16	23	28.5	42.5	-0.05
Relative Humidity	77.884	13.166	34	70	81	88	101	-0.81
Pressure	928.26	4.92	914	925	929	932	943	-0.28

34 in total

1. Best practices for missing data management in counseling psychology.

Authors: Gabriel L Schlomer; Sheri Bauman; Noel A Card
Journal: J Couns Psychol Date: 2010-01

2. The influence of meteorological parameters on soil radon levels in permeable glacial sediments.

Authors: Aud Venke Sundal; Vidar Valen; Oddmund Soldal; Terje Strand
Journal: Sci Total Environ Date: 2007-11-01 Impact factor: 7.963

3. Ground-water radon anomaly before the kobe earthquake in Japan.

Authors: G Igarashi; S Saeki; N Takahata; K Sumikawa; S Tasaka; Y Sasaki; M Takahashi; Y Sano
Journal: Science Date: 1995-07-07 Impact factor: 47.728

4. The chemistry of Norwegian groundwaters: I. The distribution of radon, major and minor elements in 1604 crystalline bedrock groundwaters.

Authors: D Banks; B Frengstad; A K Midtgård; J R Krog; T Strand
Journal: Sci Total Environ Date: 1998-10-15 Impact factor: 7.963

5. Machine learning methods as a tool to analyse incomplete or irregularly sampled radon time series data.

Authors: M Janik; P Bossew; O Kurihara
Journal: Sci Total Environ Date: 2018-03-07 Impact factor: 7.963

6. Descriptive analysis and earthquake prediction using boxplot interpretation of soil radon time series data.

Authors: Aleem Dad Khan Tareen; Malik Sajjad Ahmed Nadeem; Kimberlee Jane Kearfott; Kamran Abbas; Muhammad Asim Khawaja; Muhammad Rafique
Journal: Appl Radiat Isot Date: 2019-08-22 Impact factor: 1.513

7. ANALYSIS OF RADON TIME SERIES RECORDED IN SLOVAK AND CZECH CAVES FOR THE DETECTION OF ANOMALIES DUE TO SEISMIC PHENOMENA.

Authors: Fabrizio Ambrosino; Lenka Thinová; Miloš Briestenský; Carlo Sabbarese
Journal: Radiat Prot Dosimetry Date: 2019-12-31 Impact factor: 0.972

8. A combined analysis of North American case-control studies of residential radon and lung cancer.

Authors: Daniel Krewski; Jay H Lubin; Jan M Zielinski; Michael Alavanja; Vanessa S Catalan; R William Field; Judith B Klotz; Ernest G Létourneau; Charles F Lynch; Joseph L Lyon; Dale P Sandler; Janet B Schoenberg; Daniel J Steck; Jan A Stolwijk; Clarice Weinberg; Homer B Wilcox
Journal: J Toxicol Environ Health A Date: 2006-04

Review 9. A review of residential radon case-control epidemiologic studies performed in the United States.

Authors: R W Field
Journal: Rev Environ Health Date: 2001 Jul-Sep Impact factor: 3.458

10. Delegated Regressor, A Robust Approach for Automated Anomaly Detection in the Soil Radon Time Series Data.

Authors: Muhammad Rafique; Aleem Dad Khan Tareen; Adil Aslim Mir; Malik Sajjad Ahmed Nadeem; Khawaja M Asim; Kimberlee Jane Kearfott
Journal: Sci Rep Date: 2020-02-20 Impact factor: 4.379