Amir Aieb1,2, Khodir Madani1, Marco Scarpa3, Brunella Bonaccorso3, Khalef Lefsih1. 1. Laboratoire de Biomathématiques, Biophysique, Biochimie, et Scientométrie (L3BS), Université de Bejaia, 06000 Bejaia, Algérie. 2. Department of Computer Science, Faculty of Exact Science, Abderrahmane Mira University, Bejaïa 06000, Algeria. 3. Department of Engineering, University of Messina, Italy.
Abstract
Missing data is a very frequent problem in climatology, it influences on the quality of results that will afford in hydrological studies, as well as water resources management. This paper proposes a new imputation algorithm, based on the optimization of some regression methods, which are hot deck, k-nearest-neighbors imputation, weighted k-nearest-neighbors imputation, multiple imputation, linear regression and simple average method. The choice of these methods was justified by qualitative and quantitative statistical tests analysis. However, the reliability of obtained results depends mainly on percentage of missing data, choice of neighboring stations and data missingness mechanism which should be missing at random. During the study it was found that the most of stations in Soummam watershed don't have a good correlation because the large loss in rainfall data or the geology of watershed which gives a relationship between station position and rainfall variability. For this case, principal component analysis is applied on a set of stations; it showed a positive impact of altitude, latitude and longitude on correlation index between selected stations. The graphical analysis of the normal law on RMSE values, which were obtained by applying the proposed technique in several random cases of missingness, that are 4%, 8%, 12% and 16% respectively, it confirmed the validity and the performance of this approach.
Missing data is a very frequent problem in climatology, it influences on the quality of results that will afford in hydrological studies, as well as water resources management. This paper proposes a new imputation algorithm, based on the optimization of some regression methods, which are hot deck, k-nearest-neighbors imputation, weighted k-nearest-neighbors imputation, multiple imputation, linear regression and simple average method. The choice of these methods was justified by qualitative and quantitative statistical tests analysis. However, the reliability of obtained results depends mainly on percentage of missing data, choice of neighboring stations and data missingness mechanism which should be missing at random. During the study it was found that the most of stations in Soummam watershed don't have a good correlation because the large loss in rainfall data or the geology of watershed which gives a relationship between station position and rainfall variability. For this case, principal component analysis is applied on a set of stations; it showed a positive impact of altitude, latitude and longitude on correlation index between selected stations. The graphical analysis of the normal law on RMSE values, which were obtained by applying the proposed technique in several random cases of missingness, that are 4%, 8%, 12% and 16% respectively, it confirmed the validity and the performance of this approach.
The precipitation is one of variables commonly used to study climate variability, flow estimation and even to understand floods and landslides (Hong et al., 2007; Medlin et al., 2007). These studies require complete and reliable records of rainfall data. For this case, data was also represented by satellite maps, deduced by Algorithms to overcome spatial coverage limitations of rainfall and to optimize rainfall measurement (Adler et al., 2000; Pettazzi and Salsón, 2012; Vila et al., 2009). The availability of climatic terrestrial networks is one of factors to obtain the best estimation of precipitation for remote sensing communities (Sharifi et al., 2016), unfortunately in practice the obtained databases contain gaps due to systematic errors or other source of malfunctioning such as the absence of the observer, the destruction of gauging devices, the power failure and the elimination of incorrect data (Sharifi et al., 2016).The breaks in data acquisition of recording systems are more prevalent in Mediterranean countries (Gyau-Boakye and Schultz, 1994), it is an interesting topic for meteorologists, hydrologists and environmental managers to fill these gaps (Khosravi et al., 2015), however the problem is to find the real data while it has been judged that it is difficult to find a perfect method.In literature, it can find two possibilities for filling gaps. Firstly, to exclude matrix rows containing even a single MD (missing data elimination techniques), this technique deletes the cells of matrix that contain MD, it is widely used because of its simplicity; however, it does not give the most efficient utilization of data and it can incur a bias, just only if the values are not missing completely at random. Consequently, it can be used only in case where the gaps are very low (Song et al., 2008). Secondly, to estimate gaps of data series by replacing MD with values (imputed missing data techniques) (Li et al., 2007), the choice of these filling techniques refers to case where the strategy is the best reliable. There are many methods which can be applied to complete dataset; these procedures are designed to reduce number of gaps by replacing MD with nearest values, this process called completion or substitution. The most significant advantages of these procedures are the conservation of data dimension, therefore, the strength of statistical data processing. In a more or less broad measurement, all procedures of replacement are not important to be used if there is not random distribution of missing values. However, the replacement of MD is reliable when correlations between variables are very strong. On the other hand, it was shown that there is no ideal technique of estimation could exist (Presti et al., 2010). The effectiveness of each technique depends on a number of factors, such as the percentage of MD, the mechanism of data loss and characteristics of the variable under consideration.In this regard, there are two types of imputations to be distinguished: simple imputation and multiple imputation. Simple imputation is the one in which the missing value is replaced by the average of all measured observations in the same series, knowing that the amount of MD must be less than 5%; we can also use simple regression methods which starts by studying correlation between observed data of neighboring stations and that of reference stations, this technique is applied when data are missing randomly and the amount of the lack must be between 5% and 10% (Johnson, 2003). In this context, there are two types of regression methods, logistic regression that is used to treat discrete variables, and the linear regression applied for treating continuous variables. Multiple imputation is a treatment that uses several similar databases. It will give many replacements for the same unobserved value. Finally, the subsequent inference obtained by combining all imputed values (Sovilj et al., 2016). The most of these techniques suffer from problem of obtained results reliability when the data missed randomly, the case that let statistical studies based on the other hand to evaluate models which ignored the existence of MD, for example: Principal Component Analysis (PCA) model building (Folch-Fortuny et al., 2015; Nelson et al., 1996) and Wavelet method which is used to study climate variability on discontinuous temporal series that contains missing observations (Turki et al., 2016).This paper introduces new approach that uses hybrid methods to solve problem of reliability about obtained results of the filling, it was applied in missing climate data series. The proposed technique based on optimization of some imputation methods detailed in the content of this article.
Materials & methods
Usually, the choice of each filling method will be done according to missingness mechanisms (different ways in which data are missing). So to manage matrices characterized by incomplete data series properly, the first requirement is to identify the mechanism responsible for data losing (Little and Rubin, 1987). There are three different types of processes to be considered (Little and Rubin, 1987), the simplest case is when data loss occurs absolutely at random, which are indicated with the acronym MCAR (missing completely at random). It occurs when the probability of missing value depends neither on the variable itself nor of another variable of the database; in mathematical term, it is written:where X is observed data, X is MD and (r) is distribution condition of MD.On the contrary, when the probability that a value is missing depends on the value of other variables, and only on them, the condition is called MAR (missing at random) (Schafer, 1997). That is:Finally, the third and the last case arises when the loss of data does not occur randomly at all (NMAR), in this case, the probability that a value is missing depends on the missing value itself. That is:In general, whether the missingness mechanism is related to study variables or not, it is very significant as to determine how it is difficult to handle MD at the same time (Little and Rubin, 1987).
Methods
There are many imputation methods in the literature based on different approaches, used in specific domains or even for specific datasets transportation (Tang et al., 2015), meteorology (Junger and De Leon, 2015), and others (Folguera et al., 2015). Although, the proposed methodology can be considered when the MAR hypothesis is assumed (Gómez-Carracedo et al., 2014), it should be borne in mind that it can be applied for MCAR case, which are more random than MAR. The first step is to pre-process the time series, it means to identify gaps distribution, then to select a set of neighboring data series (e.g. neighboring weather stations) to those affected by missing values (target station), respecting the temporal behavior of variables under consideration. In climatology, we can use data recoding from weather station in the same watershed as neighboring time series if there is similarity between data using Spearman coefficient and mean absolute error. The following methods used to fill MD in this paper presented in different descriptions.
k-nearest-neighbors imputation (KNNI)
This method is based on the k observed values of the most similar time series, then all values are used into a single estimate using approaches such as the average methods or kernel function (Amiri and Jensen, 2016). If we can find single nearest neighbor data series (k = 1), in this case it called hot deck (Batista and Monard, 2003). These methods must be appropriate when data are MAR, meaning that the missing value is correlated with other observed variables.
Hot-deck
This method performs the estimation of MD of incomplete records P using values of similar complete records, belonging to the same data set when there is a large similarity between data, that means (k = 1) (Rahman et al., 2015). In our case study, the nearest station that has the greatest correlation between its values and that of the reference station which allows us to take the same value of this similar station in the same time.
Arithmetic average method
When the number of similar data series is equal two or more, a simple average method can estimate MD. In our case study, when rainfall values of each similar station have difference less than 10%, comparing to other measurement of record stations, we can apply KNNI shown in Eq. (5). But, when this difference is very large, a normal-ratio method was recommended. We can point out that the selection of neighboring precipitation stations for estimating MD must be based on meteorological judgment (Teegavarapu and Chandramouli, 2005).where P represents daily rainfall data of similar stations, k is the number of similar stations.
Weighted-nearest-neighbors imputation (WKNNI)
It is another estimation method based on the weighting coefficient of similar time series. In climate data series we used Euclidean distance between similarity stations and reference station to calculate this coefficient, in order to obtain the best estimation of MD then, the final result is determined by a weighted average of all neighboring data (Teegavarapu and Chandramouli, 2005; Troyanskaya et al., 2001). It is given by the following equation:where P is the MD observed in reference station x, k is the number of similar stations; P is imputed data, that means the observed rainfall data on neighboring station, is the weighting coefficient that equal, is the Euclidean distance between location of neighboring station i and the reference station x; and K is referred to as friction distance (Vieux, 2001), which ranges from 1.0 to 6.0.
Multiple imputation (MI)
Multiple Imputation is a filling method that provides valid statistical inferences under MAR condition. This method processes MD sets by using the standard procedures of regression, then to make combination between imputing results from these analyses for obtaining final result, as shown in Eq. (7) (Rubin, 1987; Schafer, 1997; Troyanskaya et al., 2001). The multiple imputation has been implemented in software such as SAS (Jerez et al., 2010) and Amelia II package (Honaker et al., 2011). To apply this method, it must follow the following steps: (i) Find k similar databases for each missing value; then the observed values will be used to impute MD. (ii) For each MD (P), we use data imputations P by applying regression methods to obtain k different estimate results I
(P,P). (iii) The final result P will be obtained by combining all imputations results I using average of all the k complete data values, that is:
Linear regression (LR)
The linear regression is a very fundamental computational procedure which forms the basis of many elaborate algorithms, like the alternating least squares algorithms (ALS) (Wang et al., 2003), it also assumes that values are missing at random. This method requires two steps, initially to estimate the relationship between predictors and missing values, then to use a trend equation for filling the gaps (Bárdossy and Pegram, 2014), and we can express it by Eq. (8):The value of a and b can be estimated by using respectively Eqs. (9) and (10):Where and are mean values of the data series, respectively, the reference and similarity stations (Bárdossy and Pegram, 2014; Khosravi et al., 2015).
Simple average method
According to (Johnson, 2003), for an amount of MD less than 5%, we can use any filling method; in our case study, this percentage represent just single missing value in one month. He reports that replacing MD with total average of all observations of the same month gives a good result, only if there is low correlation between variables (Presti et al., 2010), we can write that by:Where P is the observed rainfall data, n is the number of days for each month which can be (28, 29, 30 or 31) days.
Study area
This study focuses on the filling MD of daily precipitation measurements at Bejaia airport weather station of Algeria (Fig. 1), over 32 years. This station is located in the Soummam watershed with an area surface of 9200 km2, which is a part of eastern Algeria. It is bounded on the north by the mountains of Djurdjura (Lala Khadija with an altitude equal 2308 m), on the east by the plateau of Setif and Babors mountain chains (with an altitude equal 2004 m), as well as on the West by Bouira. Soummam consists of 10 sub-watersheds, equipped with 34 meteorological stations. This area offers a diverse climate and different morphological zones.
Fig. 1
Geographical map of Soummam watershed borders, followed by its position on north of algerian (medallion map).
Geographical map of Soummam watershed borders, followed by its position on north of algerian (medallion map).In this paper, the proposed filling methodology was applied to daily precipitation time series from January 1st, 1982 to December 31th, 2014, which were obtained from National Agency of Hydraulic Resources A.N.R.H and National Centers for Environmental Information (NOAA), https://www.ncdc.noaa.gov/cdo-web/. The dataset was affected by 3.47% of MD during the entire time of the study. The amount missing was detailed for each month in Fig. 2.
Fig. 2
Seasonal Percentage of missing data distribution per month over 32 year observation at the Bejaia Airport station and the related Standard entropy. *: The percentage of missing data for each month.
Seasonal Percentage of missing data distribution per month over 32 year observation at the Bejaia Airport station and the related Standard entropy. *: The percentage of missing data for each month.
Testing of data distribution
As already indicated above the ultimate objective of this paper is to apply different imputation methods for filling gaps in rainfall time series, in this case it can only be achieved if the mechanism producing data loss is not NMAR (Presti et al., 2010). To exclude this hypothesis, it is possible to test from empirical knowledge that the rainfall has a random distribution (MAR). To this aim, it was assumed that the loss of data was due to failure assessment, which is caused in particular by intense rainy events. If this hypothesis is correct, the following statements should be verifiable: The amount of MD should be affected by obvious seasonal behavior, which usually results in autumn and winter. A positive correlation between the amount of MD and the elevation of any station should exist; this means that rainfall tends to increase as function of altitude.In this context, it easy to verify the randomness of MD distribution by using standardized entropy, showed in Eq. (12) (Shannon, 1948).where m represents month, k is the number of months in one year which equal 12 and p(m) is the percentage of MD in each month.In our case, the standardized entropy coefficient was applied on inter-monthly rainfall data in all 32 years of study, to verify MD distribution during different seasons.Fig. 2 shows that no relationship exists between the amount of MD and the seasonality data distribution, showing that the greatest percentage of MD measured in summer is 10.2% but in winter it is only 7.3 %. On the other hand, the entropy value is evidently equal 0.98, it is close to 1, allowing us to reject the first hypothesis, therefore rejecting NMAR mechanism. To exclude any dependence of MD on rainfall measurements, MAR and MCAR mechanisms should be tested.An important consideration should be made to choose between these last assumptions, having to accept the influence of other variables on rainfall measurement, it can be reasonably affirmed that there are some influence also on instrumental failures that caused data losses. Accepting this last statement is equivalent to rejecting the MCAR mechanism. For this purpose (Rubin, 1976; Scheffer, 2002), point out that “MD is very rarely MCAR and they proved that daily rainfall distribution law is not Gaussian but gamma.
Station similarity
In order to determine the similarities between meteorological stations used to solve problem of MD observed in Bejaia-Airport weather station, the Spearman coefficients and residual average were compared, by coupling the time series values of the target station with the nearby stations values. To determine the similarity in index values, the whole statistical series was considered by including the “dry” days; characterized by no rainfall. This choice was proposed by respecting the following considerations: MD can also include no rainy days, as it was hypothesized that they are missed at random. To use only rainy days could bias the similarity index values as it would not allow for consideration of those events where a zero value was registered in the target station, whilst a rainfall event occurred in a neighboring station (Presti et al., 2010).Data represented relating to set of stations by PCA graph (Fig. 3) for three years that were selected randomly, this figure gives information about criteria to choose similar climatological stations to that of Bejaia-Airport, in order to fill MD observed in this station.
Fig. 3
Principal Component Analysis (PCA) score plot show the similarity between nearby stations setof Bejaia, airport station.
Principal Component Analysis (PCA) score plot show the similarity between nearby stations setof Bejaia, airport station.Fig. 3A shows the PCA graph, as a function of two main components F1 and F2, which are respectively coefficient of correlation (r) and mean absolute error (MAE). They are represented from low to strong, reading respectively; from left to right and from bottom to top. This figure shows three clusters, the choice is mainly based on the degree of correlation, knowing that F1 gives 96% of information rate.The PCA graph (Fig. 3A) shows that Tichy and Taza are the two nearest stations to Bejaia-Airport, representing the first cluster, they are very similar to each other according to the information obtained by Dendrogram (Fig. 4). However, it is preferable to choose Tichi owing to the availability of data. It provides a strong correlation index (0.98) (Table 1).
Fig. 4
Dendrogram plot show different clusters of nearest stations of Bejaia, airport weather station.
Table 1
Statistical parameters relating to weathers stations selected for filling missing data of Bejaia-Airport station.
Station
Altitude
Latitude
Longitude
Min
Max
X¯
σ
r
ER∗
R2
Tichy
1
36.67
5.16
0
33.0
4.37
7.71
0.98
0.93X1
0.97
Ziama
1
36.67
5.48
0
33.2
4.41
7.74
0.96
0.97X2+0.44
0.94
Azeffoun
1
36.79
4.42
0
33.0
4.49
7.74
0.88
0.74X3+0.75
0.77
Annaba
1
36.83
7.76
0
26.9
4.03
6.87
0.74
0.40X4+1.78
0.55
: Equation of Regression.: Average, σ: Standard deviation, R2 coefficient of determination.
Dendrogram plot show different clusters of nearest stations of Bejaia, airport weather station.Statistical parameters relating to weathers stations selected for filling missing data of Bejaia-Airport station.: Equation of Regression.: Average, σ: Standard deviation, R2 coefficient of determination.Ziama, Tigzirt and Azeffoun stations belong to the same cluster, however Ziama and Tigzirt are more similar than Azeffoun (Fig. 4), and the latter also shows a strong correlation (Table 1).Annaba and Boumerdes have the same degree of similarity, the both of them belong to the second cluster, shown in Fig. 3A; on the other hand Dar Sghir is a station of the third cluster which gives a lower similarity compared to all stations (Fig. 4).Fig. 3. B represents the PCA graph as a function of two principal components (F1 and F2), which are respectively Euclidean distance between similar stations and Bejaia-Airport station and Altitude. The first component gives an information rate equal to 74%, which represents more information, giving that the Euclidean distance for each station is obtained by using latitude and longitude. This graph shows the same clusters that were obtained in Fig. 3A.A relationship between the geographical positions of stations and the similarity of stations can be noticed. Such as, all stations of the first cluster have small Euclidean distances compared to others stations and altitude measurements between (0.9 and 1.1) which are near to Bejaia-Airport station geographical position that have an altitude equals 1m (Table 2). On the other hand, compared to the information obtained from Fig. 3A, we can deduce that when Euclidean distance decreases, the correlation index increases, as well as, when the altitude of the station increases or decreases the MAE error also increased.
Table 2
Daily rainfall datasets obtained in February 2005, used for Example 1.
Daily rainfall datasets obtained in February 2005, used for Example 1.MD (Missing Data). x1: Bejaia airport station. n1: Tichy statio,n2: Ziama station,n3: Azeffoun station,n4: Annaba station.We can confirm this information after interpreting the results of the second and third clusters. Boumerdes and Annaba are two stations belonging to different watersheds, that have the same Euclidean distance (Fig. 3B), and the correlation also gives the same information (Fig. 3A). However it is preferable to take Annaba station for filling gaps in data, because the MAE value of Boumerdes is larger than that of Annaba, we can also find the same meaning to this information using the altitude variable (Fig. 3B), knowing that altitude of Annaba is closer to that of Bejaia-Airport compared to Boumerdes.With these results we can also confirm the previous hypothesis that rainfall is missing at random; this it means that it has a deterministic role on precipitation series, which explain the relationship between geographic coordinates and correlation index.The dendrogram plot shown in Fig. 4 gives information regarding the best choice of the nearest station, according to the amount of MD recorded at each station. The graph shows that the most similar stations are respectively (Tichy, Ziama), (Azeffoun, Tigzirt) and (Annaba, Boumerdes). But in data processing, one station for each cluster can be taken, depending on the filling method used, in this case the following can be selected (Tichy, Ziama, Azeffoun and Annaba).
Algorithm description
The various methods used to fill the missing climate data series showed before were summarized in the flowchart of the new algorithm (Fig. 5). The proposed methodology was applied on space–time continuous data (i.e. rainfall time series referred to different monitoring stations belonging to the same network). The data was arranged in matrices with the following structure: each matrix represents one year of each climatological station, which has a row that it uniquely associated to the date of the observation and each column is uniquely associated to month.
Fig. 5
Flowchart summarizing the filling daily rain fall dataset. Pretreatment of climate data bases (Step1), filling missingdata with the appropriate method (Step2), checking the percentage of missing data (Step3).
Flowchart summarizing the filling daily rain fall dataset. Pretreatment of climate data bases (Step1), filling missingdata with the appropriate method (Step2), checking the percentage of missing data (Step3).The first step of this process began with pretreatment of the dataset and tested to find similarities between stations by using the geographical coordinates of each station (Fig. 2) then to calculate the percentage of MD. The chosen method shown in the second step for filling MD is determined as a function of similarity indices between stations, proximity of values representing toleration between imputation values of similarity stations, in addition to the availability of neighboring stations. In the third step, the percentage of MD after every step of the filling must be calculated, in order to check the rate of the gaps remaining after the second step, so that it can be filled either using a simple regression method (LR) or an arithmetic average (SAM). The processing is done iteratively until amount of MD is null.
Example
Table 2 shows daily precipitation measurements at Bejaia-airport station during February 2005. The effectiveness of this new approach is shown by using 25% of MD that is chosen randomly throughout these series. In this table, stations noted Pn1, Pn2, Pn3 and Pn4 are respectively, Tichy, Ziama, Azeffoun and Annaba which are used as neighboring stations to process the MD in this example. Table 1 shows statistic parameters of the neighboring stations and large correlation indexes can also be seen, between (0.74 and 0.98), meaning that the similarity between selected stations and the reference station is already existent.In this example MD are noted by '-' and 'K' means the number of neighboring stations; 'Δ' is the toleration between the rainfall data imputed values of all similar stations in the same time. The calculation steps are the following:Step1 (Find the K nearest stations):According to Table 1 and the result obtained in (Fig. 4), Tichy station can be used to fill gaps by hot deck method, the neighboring stations which are respectively, Ziama, Azeffoun and Annaba have been used for filling data by KNNI, WKNNI, LR and MI methods.Step2 (Filling gaps):In the 1
day we have P is MD values and K is the number of similar stations that equal 4, but in this case it is better to choose Tichy station (Table 1) for filling gaps P by using Hot-deck method:For the 3
day: P is missing and the number of similar stations K is 3, which are Ziama, Azeffoun and Annaba. So, according to Fig. 5 and Table 1, we can choose between KNNI or WKNNI methods, this implies calculating the difference between imputed values (Δ):According to these results, it is better to use only Ziama (Pn2) and Azeffoun (Pn3) stations for filling data by applying KNNI method:For 5
day: in this case, there are just two similar stations (Ziama and Azeffoun), so to calculate MD (P), we need to choose between WKNNI and MI methods.Value is more than 10% and Table 1 shows that the correlation index of rainfall values between “Ziama and Bejaia-Airport” and also between “Annaba and Bejaia-Airport” are 96% and 88%, respectively. So it is preferable to apply WKNNI method:For 11
day: we have two similar stations that are Ziama and Annaba.One can see in Table 1 that correlation index, between Annaba and Bejaia-Airport station is 74%. In this case, according to Fig. 5, we can fill the datum using MI method:To use this method, we must calculate the regression equations, shown on (Table 1) that will be applied in each imputation, which are:For 16
day: the number of similar stations K is two (Ziama and Annaba), they have correlation index of 96% and 74%, respectively (Table 1). In this case Δ equal 31%, so we have to use only WKNNI (Fig. 5).For 19
day: the percentage of MD obtained on Bejaia-Airport station series is 7.14%, so it can fill these gaps using LR method (Fig. 5):For 28
day: the percentage of MD in Bejaia-airport station series is 3.57%, it is less than 5% and it doesn't have at least one similar station, so in this case we can apply only SAM to fill the gaps (Fig. 5):
Experimental
In this section, experimental work was done to compare all methods used in the proposed technique of filling missing climate data. The validation criteria are applied on daily rainfall data for three years which were randomly selected (1997, 2003 and 2013). The calculations applied to fill lack of data in this part were obtained by the implementation of the proposed algorithm on the Delphi language platform (version XE2).We have chosen four cases where data was missing, which are respectively (4%, 8%, 12% and16%) for showing the reliability of these methods respecting to the same conditions proposed by (Presti et al., 2010). To validate filling data results, it is necessary to compare the obtained rainfall data with actual measurements of the same station using a coefficient of determination R2 showed in Eq. (13) and an adjusted coefficient of determination R2Adj showed in Eq. (14). The Error measurements must also be accepted to ensure their reliability, for this case we have used others parameters, such as root mean squared error RMSE, mean absolute error MAE, which are given by Eqs. (15) and (16), Knowing that the lower RMSE value gives the best imputation (Chai and Draxler, 2014).Where ‘n’ is the total number of observations, X are the observed values and X are the modeled values. N is the number of points in our data sample. K′ is the number of independent repressors, i.e. the number of variables in our model excluding the constant.
Results & discussion
In this comparison, different cases of missing values amount were applied to justify the best use of the filling techniques. The number of similar stations considered in this estimation is three and the nearest station, which are Ziama, Azeffoun, Annaba and Tichy respectively.Fig. 6 show Residual Analysis plots followed by linear regression graphs between imputation results and real rainfall measurements for each filling method when data missed 4%, 8%, 12% and 16%. These graphs show that in the case of 4%, all methods are acceptable and this result confirmed the proposition of Presti about the choice of methods (Presti et al., 2010), we can take into consideration the best choice according to Table.3 which shows that Hot-deck, KNNI, WKNNI and MI methods perform better than LR and SAM.
Fig. 6
Residual curves obtained from linear regression graphs (medallion graph) between predicted and experimental values of rainfall data when the amount of missing data equal 4%, 8%, 12% and 16%. (EV) Experimental values, (PV) predicted values, (*): Missing Data.
Table 3
Performance indicators for rainfall missing data when amount are equal to 4, 8, 12 and 16%. Determination coefficient (R), adjusted determination coefficient (R), root mean square error (RMSE), mean absolute error MAE.
Hot-Deck
KNNI
WKNNI
MI
LR
SAM
1st case: 4 % of MD∗
R2
0.999
0.930
0.853
0.924
0.725
0.562
RAdj2
0.999
0.930
0.852
0.924
0.723
0.56
RMSE
0.009
0.069
0.011
0.113
0.254
0.269
MAE
0.001
0.001
0.007
0.015
0.042
0.024
2ndcase: 8 % of MD∗
R2
0.995
0.918
0.826
0.847
0.712
0.341
RAdj2
0.995
0.918
0.826
0.847
0.711
0.338
RMSE
0.013
0.137
0.045
0.241
0.396
0.709
MAE
0.002
0.004
0.018
0.04
0.086
0.039
3rdcase: 12 % of MD∗
R2
0.999
0.994
0.956
0.817
0.679
0.261
RAdj2
0.999
0.994
0.956
0.816
0.678
0.258
RMSE
0.035
0.094
0.08
0.637
0.959
1.077
MAE
0.003
0.003
0.042
0.136
0.246
0.156
4thcase: 16 % of MD∗
R2
0.999
0.997
0.983
0.952
0.397
0.265
RAdj2
0.999
0.997
0.983
0.952
0.394
0.246
RMSE
0.128
0.218
0.548
0.978
3.891
4.327
MAE
0.004
0.002
0.086
0.197
1.029
1.083
MD (Missing Data).
Residual curves obtained from linear regression graphs (medallion graph) between predicted and experimental values of rainfall data when the amount of missing data equal 4%, 8%, 12% and 16%. (EV) Experimental values, (PV) predicted values, (*): Missing Data.Performance indicators for rainfall missing data when amount are equal to 4, 8, 12 and 16%. Determination coefficient (R), adjusted determination coefficient (R), root mean square error (RMSE), mean absolute error MAE.MD (Missing Data).When the global rate of MD is 8%, the residual values corresponding to SAM graph show that most of the values are negative, flowing with a high fit of data which is remarked on the regression plot (Fig. 6), this method is not usable in this case of study, as it noted a bad correlation (Table 3). The graph shown in Fig. 7 proves that Hot deck and KNNI methods perform the best. WKNNI and MI are similar, having a correlation index that equals 0.82 and 0.84 respectively, however the histogram graph represented by Fig. 8 shows that RMSE values obtained when using MI are higher compared to WINNI.
Fig. 7
Radar graph of the adjusted determination coefficientobtained from differentmissing rainfall data imputation methods with different amount of missing data at Bejaia airport station.
Fig. 8
Histogram of root mean square errorRMSE obtained from different missing rainfall data imputation methods with 4, 8, 12 and 16% of missing data at Bejaia airport station.
Radar graph of the adjusted determination coefficientobtained from differentmissing rainfall data imputation methods with different amount of missing data at Bejaia airport station.Histogram of root mean square errorRMSE obtained from different missing rainfall data imputation methods with 4, 8, 12 and 16% of missing data at Bejaia airport station.For a case where the missing data is 12%, the graph shown in Fig. 6 pointed out that the results obtained when LR method was used is less reliable in comparison with all previous cases.Table.3 shows that hot deck, KNNNI, WKNNI are the methods that perform the most having a strong correlation which are more than 95%, on the other hand the filling by using MI method is still feasible and giving regression index of 81%.In the last case, the results are given in Table 3. When the percentage of MD was 16%, suggesting that the hot-deck method always provides the best performance, based on both of RMSE and MAR parameters, it can also be found that KNNI and WKNNI provided good values of R2, which are respectively; 0.997 and 0.923.However, the MI method shows a very high performance, compared to WKNNI (Table.3), as most of the imputation cases to fill the data gap in this study use the data observed at the Annaba station. The latter has a degree of correlation of 74% (Table.1). The regression equations used under MI that were obtained by the correlation study between the observed data of each station are respectively:Where Y is daily rainfall data imputation in Bejaia-Airport station, x is daily rainfall data observed in Ziama station, x is daily rainfall data observed in Azeffoun station, x is daily rainfall data observed in Annaba station.A bad linear relationship and a great deviation of residual values which was observed between experimental and imputing values when LR and SAM was used (showing in Fig. 6), demonstrated that it should be rejected.After testing all cases of MD, the filling final results explained that the hot-deck method outperforms in all slices filling (Fig. 6). The validation tests relative to KNNI and WKNNI shown in Fig. 7 and Fig. 8, highlight that there is no significant difference in relationship with the respecting amount of missing values.In Table 4 we can see that in all cases where the distance is varying, the choice of K equal 1 gives the best filling results when WKNNI method is used. But when this distance is very large, the similarity indexes of neighboring stations are less than 75%, it can be seen that MI method performs better because the small variation on rainfall intensity can introduce significant changes in the runoff values generated from distributed rain-runoff models (Vieux, 2001). Considering the amounts of missing values, it is obvious that the results are quite similar to what is obtained for 5% of MD. When this percentage rises, the RMSE raises slightly, showing in Fig. 7. All the errors of obtained measurements and the good fit criterion point conclude that hot-deck, KNNI and WKNNI are better methods compared to regression methods such as MI and LR.
Table 4
Weighting coefficients and adjustment quality parameters obtained from WKINNI and MI methods taking into account the neighboring stations and their separating distances (the amount of missing data is equal to 16%).
Stations similarity (Xi)
ED(Km)
Method
K
R2
RAdj2
RMSE
MAE
Ziama
41
WKNNI
1
0.9989
0.9989
0.0276
0.0427
Azeffoun
58.4
2
0.9986
0.9986
0.0312
0.0441
3
0.9983
0.9983
0.0343
0.0454
4
0.998
0.9980
0.037
0.0465
5
0.9978
0.9977
0.0392
0.0474
6
0.9975
0.9974
0.0409
0.0481
MI
-
0.997
0.9969
0.0447
0.0774
Taza
42.5
WKNNI
1
0.8001
0.7955
0.3837
1.1604
Tigzirt
86.4
2
0.7772
0.7721
0.4305
1.2729
3
0.7599
0.7544
0.468
1.3750
4
0.7495
0.7437
0.4915
1.4378
5
0.7438
0.7379
0.5045
1.4784
6
0.7227
0.7163
0.435
1.3303
MI
-
0.8117
0.8074
0.3658
0.0862
Annaba
122.2
WKNNI
1
0.6976
0.6906
0.4641
1.1433
Dar sghir
98.5
2
0.6859
0.6787
0.4746
1.1664
3
0.6723
0.6648
0.4858
1.2037
4
0.6576
0.6497
0.4972
1.2387
5
0.6272
0.6184
0.5083
1.2707
6
0.6272
0.6184
0.5189
1.2995
MI
-
0.7435
0.7377
0.4139
0.8448
Euclidean distance between the neighboring station and Bejaia Airport station(ED), coefficient using to calculate weighting coefficient (K), determination coefficient, adjusted determination coefficient, root mean square error and mean absolute error.
Weighting coefficients and adjustment quality parameters obtained from WKINNI and MI methods taking into account the neighboring stations and their separating distances (the amount of missing data is equal to 16%).Euclidean distance between the neighboring station and Bejaia Airport station(ED), coefficient using to calculate weighting coefficient (K), determination coefficient, adjusted determination coefficient, root mean square error and mean absolute error.Fig. 9 represents a simulation of RMSE results obtained over the same period of study after to fill (4%, 8%, 12% and 16%) of MD by the proposed algorithm. In each case, RMSE was calculated 50 times randomly for all climatic seasons, in order to study density of error distribution relating to this technical applied on daily rainfall dataset. All graphs show that the distribution of this random variable follows normal law, knowing that in this case the representation of RMSE values were done in class.
Fig. 9
Simulation of RMSE density obtained by generating randomly 50 times of missing data of each season, using 4% (A), 8% (B), 12% (C) and 16% (D) of missing data.
Simulation of RMSE density obtained by generating randomly 50 times of missing data of each season, using 4% (A), 8% (B), 12% (C) and 16% (D) of missing data.When the missing data is 4%, the graph of Fig. 9A shows a symmetric distribution recorded in autumn, spring and summer, which for each case gives an average (U = 5) and median (med = 5), the variability changed from weak to strong, respectively, from the driest to the wettest season. On the other hand, the filling data in winter graphically shows an asymmetric distribution (right skewed distribution) that gives positive errors justified by a variability index, which equal 0.038.When this amount of MD increases to 8%, 12% or 16%, it is found that RMSE distribution is always symmetrical in summer (Fig. 9. B, C, D), only the variability index (Ϭi) increases relatively to the amount of MD; this means that the filling data in this season gives a large proportion of RMSE which always turns around the average. Fig. 9. B, C, and D show that the filling data in autumn and winter when the percentage of MD increases is marked by positive distribution, the variability increases as function of this amount. Whereas the density of RMSE recorded in spring is right skewed distribution only when the lack of data equals 12% and 16%.After all of these cases, it can be deduced that the filling of data using the proposed algorithm respects the seasonality of the same rainfall series. The density of errors is always positive, which increases relatively with the amount of MD and converges to small values.
Conclusion
Climate databases suffer from the problem of MD, however the best choice of each method is still on studying missingness mechanism and percentage of MD. The main contribution of this article is to propose a new hybrid technique of filling gaps for estimating missing climate databases more reliable, using climatological network data.The work shows two main results: (I) explaining the best choice between several of methods, according to the results of comparison using statistical tests on different cases of MD amount; the results shows that the Hot-Deck, KNNI and WKNNI methods give the best filling. It does not depend on the percentage of data. It can be seen that when Tichy, Jijel and Azeffoun stations were chosen for filling 4%, 8%, 12% and 16% of MD, the RMSEs values are always between 0.001 and 0.548. On the other hand, the choice of Annaba and Dar Sghir stations show that the MI method performs better than WKNNI, which have RMSEs of 0.4139 and 0.4641 respectively; these results depend on a degree of similarity between stations, which is less than 75% (II) It shows the influence of altitude, latitude and longitude on the correlation index and the residual average between neighboring stations and reference station. This variation affects the imputation results obtained by MI, LR and SAM. The proposed filling methodology can be also applied to estimate data sets of other case studies when the missing distribution is random.
Related work
We want to propose as a future work of this article, the implementation of our filling technique in form of new software that will serve not only climate data bases but we will try to include other methods to generalize the model of filling data. The algorithm will verify the reliability of the results and even the asymptotic complexity of the algorithm comparing to other software and scripts like (Amelia II script for R) and (MDI toolbox for Matlab) and also SPSS software.
Declarations
Author contribution statement
Amir Aieb: Performed the experiments; Analyzed and interpreted the data; Wrote the paper.Khodir Mandani: Conceived and designed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper.Marco Scarpa; Brunella Bonaccorso: Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper.Khalef Lefsih: Conceived and designed the experiments; Analyzed and interpreted the data; Wrote the paper.
Funding statement
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Competing interest statement
The authors declare no conflict of interest.
Additional information
No additional information is available for this paper.