Literature DB >> 34351977

The water supply association analysis method in Shenzhen based on kmeans clustering discretization and apriori algorithm.

Xin Liu^1,2, Xuefeng Sang², Jiaxuan Chang², Yang Zheng², Yuping Han¹.

Abstract

Since water supply association analysis plays an important role in attribution analysis of water supply fluctuation, how to carry out effective association analysis has become a critical problem. However, the current techniques and methods used for association analysis are not very effective because they are based on continuous data. In general, there is different degrees of monotone relationship between continuous data, which makes the analysis results easily affected by monotone relationship. The multicollinearity between continuous data distorts these analytical methods and may generate incorrect results. Meanwhile, we cannot know the association rules and value interval between features and water supply. Therefore, the lack of an effective analysis method hinders the water supply association analysis. Association rules and value interval of features obtained from association analysis are helpful to grasp cause of water supply fluctuation and know the fluctuation interval of water supply, so as to provide better support for water supply dispatching. But the association rules and value interval between features and water supply are not fully understood. In this study, a data mining method coupling kmeans clustering discretization and apriori algorithm was proposed. The kmeans was used for data discretization to obtain the one-hot encoding that can be recognized by apriori, and the discretization can also avoid the influence of monotone relationship and multicollinearity on analysis results. All the rules eventually need to be validated in order to filter out spurious rules. The results show that the method in this study is an effective association analysis method. The method can not only obtain the valid strong association rules between features and water supply, but also understand whether the association relationship between features and water supply is direct or indirect. Meanwhile, the method can also obtain value interval of features, the association degree between features and confidence probability of rules.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34351977 PMCID： PMC8341608 DOI： 10.1371/journal.pone.0255684

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

The water supply fluctuation can be affected by a variety of factors, including floating population (FP), rainfall (R), industrial structure, human activities and so on. If we want to understand the cause of the water supply fluctuation, a scientific analysis method need to be designed to carry out the association analysis between features and water supply (WS), so as to grasp fluctuation interval of WS from change in value interval of these features. However, this is not a traditional problem of finding the optimal solution in multiple objectives [1], but a complex problem of association analysis, so the method to find the optimal solution based on the objective function and heuristic algorithm cannot solve this problem [2, 3]. We need to know not only the association rules between features and WS, but how are they correlated to WS, and what are the sensitive interval of features. For example, an increase in FP will lead to an increase in domestic water use (DWU), industrial water use (IWU) and service industry water use (SIWU), which in turn will lead to an increase in sewage discharge. But the wastewater reuse (WR) can reduce the amount of WS to a certain extent, and we cannot know the fluctuation interval of WS in the end. This situation does not provide a strong support for water supply dispatching. This is because it is found through studies and experiments that the water demand prediction can only accurately predict the trend of water demand, it is difficult to accurately predict the fluctuation. The acute fluctuation of WS causes large errors between the amount of WS and predicted value of water demand. However, the water supply dispatching depends on water demand prediction to a large extent. If the amount of water diversion is greater than the amount of WS, the water level of the reservoirs will rise. Once there is a large-scale rainfall, the reservoir may release surplus water. If the amount of water diversion is less than the amount of WS, the water level of the reservoirs will fall. According to the dispatching regulation of the reservoir, it is possible to increase the amount of water diversion in the next few days in order to raise the water level of the reservoir. If the amount of WS decreases and there is large-scale rainfall, the reservoir will release more surplus water. This will waste a lot of limited water resources and decrease the efficiency of WS. Therefore, it is necessary to grasp the causes and laws of water supply fluctuation so that water supply dispatching can be carried out more scientifically. In the attribution analysis of water supply fluctuation, the association analysis of WS is essential. Common association analysis methods include similarity analysis [4, 5], cluster analysis [6, 7], regression model [8-10], time series model [11-13], and artificial neural network (ANN) [14-16]. Similarity analysis and clustering analysis methods can effectively classify a variety of data, and these methods have been widely used in many fields [17, 18]. Regression models and time series models can generate multiple regression equations and autoregressive equations, and the coefficients of the equations can represent the association relationship and association degree between independent variables and dependent variable [19-21]. The ANN models can generate potential quantitative relationships between features, which can be transmitted by weights and biases. However, based on a large number of studies and experiments, for continuous data, it is found that there is serious multicollinearity [22, 23] between features. The regression models are prone to distortion or generate spurious regression equation [24], resulting in invalid association analysis. When an independent variable that has a strong monotone relationship with the dependent variable is added, the previously sensitive independent variable is likely to become less sensitive. Meanwhile, the analysis results of regression models, similarity analysis and clustering analysis are easily affected by monotone relationship [25]. If there is a strong monotone relationship between the features, the features can be easily judged to be similar or clustered into a category. The performance of ANN model depends on the data quality and input feature selection to a large extent. If the input features and output features of ANN model have a strong monotone relationship, the modeling effect of ANN will be good, and the input features can well simulate the changes of output features. The work in [16] used some multiple feature selection techniques, such as Pearson correlation and principal component analysis (PCA), but the Pearson correlation coefficient is an evaluation method of the linear relationship between two features. When the features do not follow a normal distribution or have more complex linear correlation relationship, the Pearson correlation coefficient is no longer valid. The PCA has mapped the original features, and we can only use these new features to simulate the change of the target feature, but we cannot know the association rules between the original features. In fact, the multicollinearity and monotone relationship often exist between continuous data, which will have a great impact on the effectiveness of the analysis results. More importantly, there is no means of validating the analysis results, nor can the value interval of features be obtained. If the value interval of features changes, we cannot know whether these features are still sensitive. These facts make these analysis results unreliable. In addition, through the above methods, we still cannot know whether features are directly or indirectly associated, nor can we know their association rules. Therefore, these analysis methods reveal huge limitations in the association analysis, and surprisingly little research has been done on valid association analysis between features and WS. At present, data mining methods [26-28] have gradually become a major breakthrough in knowledge discovery. The data mining methods can find the hidden information that the data cannot tell us, and obtain the previously unknown and valuable knowledge. Consequently, this study proposed a data mining method coupling discretization [29-31] and apriori algorithm to solve these problems in water supply association analysis. The kmeans clustering algorithm [32, 33] is used to carry out the discretization of continuous data to obtain the one-hot encoding that can be recognized by apriori, and the discretization can also avoid the influence of multicollinearity and monotone relationship. The apriori algorithm [34-36] is used to carry out the association analysis. A scientific method of water supply association analysis was proposed, the strong association rule (SAR) was established, and the value interval between features and confidence probability of the SAR were recognized. The water supply association analysis is helpful to understand the causes of water supply fluctuation and grasp the fluctuation interval of WS, which can provide strong support for urban water supply dispatching.

Materials and methods

Discretization

Apriori algorithm is a rule-based machine learning algorithm [37-39], which can effectively find association rules between features. However, the research data are continuous, which is inconsistent with the data structure of apriori algorithm. The apriori algorithm can only identify one-hot encoding, which is one-bit valid encoding. Therefore, it is necessary to carry out data discretization and transform the encoding into one-hot encoding, which is the data structure that can be recognized by apriori. The data discretization is to divide the value range of data into discrete intervals, and discrete data can also avoid the influence of multicollinearity and monotone relationship. The quality of data discretization will have an impact on the association rules and value interval generated by the apriori algorithm. The data discretization methods include supervised methods and unsupervised methods, and the classification criterion is whether the data contains category information. The supervised discretization methods need to takes into account category information, while unsupervised discretization methods do not. If the supervised discretization method is to be used, the category of the target data needs to be manually annotated. Therefore, unsupervised discretization methods are used more widely, and the unsupervised discretization methods are used in this study. The unsupervised discretization methods include equal width discretization, equal frequency discretization and clustering discretization. This study shows the discretization results of three methods (Fig 1), and the kmeans clustering algorithm is used in this study. This is because among the current clustering algorithms, the kmeans clustering algorithm is considered to be a method with good performance and low calculation load, and the kmeans clustering algorithm has strong self-adaptive ability.

Fig 1

Comparison of discretization methods.

0, 1 and 2 are the category label.

Comparison of discretization methods.

0, 1 and 2 are the category label. As can be seen intuitively from Fig 1, the data is divided into three intervals after discretization, and the blue, orange and green dots represent the value distribution within the three intervals. The value range of WS is [10189, 18687] 104m3. Equal width discretization is to divide the value range of data into intervals of the same width. The uneven data distribution results in very few data in the first interval, which can cause damage in the process of data mining. Equal frequency discretization means that the amount of data in each interval is the same. In order to ensure the consistency of quantity, if multiple same data appears, the probability of these data being divided into different intervals is very high, which will also cause damage to the data mining process to a certain extent. At the same time, the equal frequency discretization can also cause the difference of the interval width. However, the interval width of kmeans clustering discretization has little difference, and the data distribution is even, so kmeans algorithm has a better performance on the data discretization. The kmeans clustering algorithm is finally applied for data discretization in this study.

Apriori algorithm

Apriori adopts layer by layer iterative search, and the whole process consists of splicing and pruning. The purpose of splicing is to complete the construction of data item set. Then, support degree of data item is calculated. If an item does not meet the support degree threshold, it is pruned off, and we can obtain frequent item set through splicing. The association rules are generated through the cross of frequent item and those rules that do not meet the confidence degree threshold are pruned off. The valid association rules are obtained after the lift degree validation. The apriori algorithm iteratively generates association rules and validates them until it stops when no more rules can be generated. The apriori algorithm uses recursive method, and the detailed algorithm process (Eqs 1 to 4) are as follows. Step 1: Initializing k = 1.The candidate set (CF) of k-item frequent set is calculated, and the frequent set (SF) that is larger than the support degree threshold is selected. Step2: SF itself crosses to generate the candidate set (CR) of k-item association rule, and the association rule (R) that is larger than the confidence degree threshold is selected. Step3: If the (k+1)-item frequent set is empty, the k-item frequent set is set as the final result and the iteration stops. where i is the data item, c is candidate, sup is support degree, conf is confidence degree, sup is the support degree threshold, conf is the confidence degree threshold. Support degree (Eqs 5 and 6) is equivalent to the probability, which represents the ratio between the number of event sample and the number of total sample. Support degree reveals the probability that event occurs. If the P(A) is small, it means that the sample number of event A is small. If P(AB) is small, it can only indicate that the sample number of event A and B is small, but it does not mean that the association degree between event A and B is not strong. The event B may appear in all records where event A appears, which indicates that event A and B are associated, simply because the N is very large, resulting in a small value of P(AB). where A and B are the event, P is the probability, N is the number of sample. Confidence degree (Eq 7) is equivalent to the conditional probability. If the confidence degree is high, the occurrence probability of B is high when A occurs. On the contrary, the occurrence probability of B is low when A occurs. where conf is confidence degree, A→B is association rule, and P(B|A) is the conditional probability. The lift degree (Eq 8) is used to validate the SAR and prune off the spurious SAR. From the properties of conditional probability, lift degree is greater than or equal to 1. When lift degree equals to 1, P(AB) = P(A)P(B), then A and B are independent of each other. This phenomenon indicates that even if the confidence degree is very high, but the lift degree is less than or equal to 1, indicating that this rule is spurious SAR. When the lift degree is greater than 1, the SAR is a valid SAR. The higher the lift degree of SAR, the stronger the association degree of SAR, and the smaller the lift degree of SAR, the weaker the association degree of SAR. where lift is lift degree. Another validation standard of apriori algorithm is leverage rate (Eq 9), which is a variant of the lift degree. When the lift degree is less than 1, the leverage ratio is less than 0. When the lift degree is 1, the leverage ratio is 0. When the lift degree is greater than 1, the leverage ratio is greater than 0. Only the calculation methods of the two standards are different, so the leverage ratio is not calculated in this study. where leve is leverage rate.

Coupling method

The different discretization parameters will generate different value intervals. Therefore, this study inputs different discretization parameters into kmeans, hoping to find the optimal discretization parameter (D) through finite iteration to achieve the best effect of data mining. The purpose of this study is not only to obtain more SAR, but also to obtain valid SAR and detailed value interval. We may lose quality if we focus only on quantity of SAR, and we may lose quantity if we focus only on quality of SAR. Therefore, the objective function (Eq 10) set in this study is that the average lift degree of the valid SARs is maximum. The objective function can avoid the contingency results and make the analytical results of iterative calculation more reliable. where n is the number of valid SAR. If the support degree threshold is set too high, although it can reduce the time to calculate frequent item set in the process of data mining, it is easy to cause some important frequent items hidden in the data to be filtered out. Because the confidence degree need to be calculated after support degree, so the support degree threshold should be set as small as possible. If the confidence degree threshold is set too low, a large number of rules with weak association degree with WS may be generated, leading to a high calculation load and greatly increasing the time of data mining. Since the lift degree of SAR need to be calculated, so the confidence degree threshold does not have to be set very high so as to ensure that more rules are generated. Therefore, considering the calculation load and efficiency of algorithm, combined with previous data mining experience, the support degree threshold is set to 0.08, and the confidence degree threshold is set to 0.5 (Eq 11). The SAR is the association rule extracted by the support degree threshold and confidence degree threshold, and the SAR validated by the lift degree threshold is the valid SAR. where lift is the lift degree threshold. Fig 2 shows the flow chart of coupling method. The D is initialized, and the discretization breakpoints and value intervals are generated to build a data structure that can be recognized by the apriori algorithm. Finally, the SAR can be obtained, and the cross rules in the same category and different categories can be generated and their confidence probabilities and value intervals are obtained. In addition, the method can recognize whether they are directly or indirectly associated. The methods and algorithms in this study are developed using Python 3.

Fig 2

The flow chart of coupling method.

Study area and data description

Study area

Shenzhen, a sub-provincial city of Guangdong Province, is a special economic zone of China approved by the State Council, adjacent to Hong Kong. Affected by topography, the city has no large rivers, and the water supply reservoirs in Shenzhen are mainly medium reservoirs and small-scale reservoirs. Shenzhen has abundant rainfall, with an average annual rainfall of 1830 mm. But the water supply dispatching in Shenzhen mainly relies on water diversion, which accounts for more than 80 percent of the total water supply. At the same time, there is an acute fluctuation in water supply of Shenzhen. Therefore, it is essential to carry out association analysis of water supply in Shenzhen and understand the fluctuation cause and fluctuation interval of water supply.

Data description

The water use of Shenzhen consists of DWU, IWU, SIWU, ecological water use, agricultural water use and construction industry water use. The Pearson correlation coefficient is an evaluation method of the linear relationship between two variables. When the variables do not follow a normal distribution or have more complex linear correlation relationship, the Pearson correlation coefficient is no longer valid. Spearman’s rank correlation coefficient (SRCC) (Eqs 12 and 13) [40] evaluates a monotone relationship between two variables. SRCC does not require prior knowledge, and its application scope is wider than Pearson correlation coefficient. Therefore, SRCC was applied in this study to calculate the correlation coefficient, and the monotone relationship between DWU, IWU, SIWU and WS was better. where R(X) and R(Y) are the rank of features X and Y; n is the number of element. In addition, Shenzhen is a fully urbanized city with a FP of more than 8 million. At the same time, a large amount of water use has produced a large amount of wastewater, so the amount of wastewater reuse (WR) in Shenzhen is increasing. The migration of FP, the amount of R and the amount of WR have great influence on WS. Therefore, the six features including DWU, IWU, SIWU, FP, WR and R were selected in this study. The data for this study came from the monthly data of Shenzhen Bureau of Water, the Shenzhen Bureau of Statistics, the Shenzhen Water Group and Digital Water System from 2004 to 2019. The variable descriptions are presented in Table 1. According to the standard deviation, the fluctuation of WS is acute, and the amount of WR has increased obviously.

Table 1

Variable description.

Feature	Minimum	Maximum	Average value	Standard deviation
DWU/10⁴m³	4283.1	7107.69	5790.65	568.46
IWU/10⁴m³	2689.23	6066.46	4500.22	681.92
SIWU/10⁴m³	1663.95	5549.5	3491.74	914.66
FP/10⁴people	348.88	856.12	618.38	108.12
R/mm	30.07	398.86	158.43	110.38
WR/10⁴m³	0.33	1696.78	574.17	566.5
WS/10⁴m³	10189.05	18686.78	15721.55	1682.78

Results

Exploratory data analysis

The exploratory data analysis (EDA) [41] of the dataset is carried out first. The EDA can let us intuitively understand the characteristics of data and know the potential quantitative relationship between them, which is a valid method of data analysis. Meanwhile, EDA helps discover what the data is trying to tell us and can be used to look for patterns and relationships. Among the various EDA methods, the most effective method is the scatter density plot matrix (Fig 3). The scatter density plot matrix can describe a variety of complex relationships, such as the density distribution of the single feature, the density relationship and correlation relationship between features, and the function of this method is strong. In order to draw the plot clearly, data normalization is performed (Eq 14). However, in the process of data discretization and data mining, the data normalization is not performed to preserve the original potential relationships between the features. where V and V are the minimum and maximum values of the sample, and V is value at time i.

Fig 3

Scatter density plot matrix.

Fig 3 shows that the upper triangle is the scatter distribution plot between two features, the diagonal is the density curve of the single feature, and the lower triangle is the two-dimensional density plot between two features. The scatter distribution plot is used to describe the correlation relationship between two features. According to the scatter distribution plot between FP and WS, when FP increases, WS increases, which indicates that there is a positive correlation relationship between FP and WS. Similarly, the FP is positively correlated with DWU and SIWU, and the WS is positively correlated with the DWU and SIWU. The WS also is positively correlated with the IWU in the interval of [0, 0.5]. But we have no way of knowing how WR and R correlate to other features. The density curve presents the density distribution of the single feature. The horizontal axis represents the value range of this feature, and the vertical axis represents the density. The WR and R have two wave peaks, all other features have only one wave peak. The wave peak represents the position with the highest density in the value range. The FP is taken as an example. The FP has only one wave peak, indicating that in the value range of FP, only one value interval has a high density. The position with the highest density corresponds to coordinate of approximately 0.5 on the horizontal axis. The value range of FP is [348.88, 856.12] 104people, and the original value of 0.5 on the horizontal axis is 602.5 104people. The two ends of the value range are the least dense. Similarly, the WR and R show two wave peaks, indicating that these two features have two high-density intervals. The two-dimensional density plot is used to describe the density between two features. As can be seen from the two-dimensional density plot, the density distribution of features is not even, but denser in some intervals. The two-dimensional density plot between WS and FP is taken as an example. The two-dimensional density plot is concentrated around the line y = x, which indicates that there is a good positive correlation relationship between the two features. Meanwhile, when FP is within the interval [0.3,0.7] and WS is within the interval [0.5,1], the color of the density plot is the darkest. This result indicates that when FP is within the interval [0.3,0.7], WS has the maximum density within the interval [0.5,1]. Similarly, according to the two-dimensional density plot between WR and SIWU. Since WR has two high-density intervals, the two-dimensional density plot is divided into two regions. The lower part of the density plot is darker in color, indicating that when WR is within the interval [0,0.5], SIWU has the maximum density within the interval [0.3,0.7]. These results reveal a potential quantitative relationship between features. However, these relationships are still a little fuzzy, and the value interval of feature is not clear enough. So more detailed rules and value intervals still need to be generated by data mining algorithm.

Experiment results

The calculation results of the coupling method show that the average lift degree of the valid SARs is maximum when D is equal to 3. The larger the value of D, the less the number of valid SAR, and the lower the sensitivity between the features and WS. The smaller the value of D, the more the number of valid SAR, and the higher the sensitivity between the features and WS. At the same time, if the value of D is relatively small, the interval width will be relatively large, and spurious SARs that do not meet the lift degree can appear in the results. The results of D = 3 and 4 are presented in Table 2. When D is equal to 3, the number of valid SAR is 23, the objective function value is 2.71, and the feature is more sensitive, but spurious SARs appear in the results. When D is equal to 4, the number of valid SAR is 20, the objective function value is 2.17, and the sensitivity of features decreases, but there are no spurious SARs in the results.

Table 2

The result comparison of the different D.

D	Objective function value	Number of valid SAR	Whether there is spurious SAR	Feature sensitivity
3	2.71	23	Yes	Strong
4	2.17	20	No	Weak

In order to compare the data mining results of different D more clearly, the results of D = 3 and 4 are compared and analyzed in this study. Their discretization results are presented in Tables 3 and 4. The discretization breakpoints are generated within the valid value range of continuous data, and the breakpoints of different D are also different. Compared with the results of D = 3, the results of D = 4 have four discretization intervals and smaller interval width, the interval is divided in more detail.

Table 3

The discretization results of D = 3.

Feature	Category and interval of D = 3
Feature	1	2	3
DWU/10⁴m³	(4283.1, 5261.45]	(5261.45, 6074.3]	(6074.3, 7107.69]
IWU/10⁴m³	(2689.23, 4154.58]	(4154.58, 4685.48]	(4685.48, 6066.46]
SIWU/10⁴m³	(1663.95, 2621.15]	(2621.15, 3958.37]	(3958.37, 5549.5]
FP/10⁴people	(348.88, 509.95]	(509.95, 676.51]	(676.51, 856.12]
R/mm	(30.07, 118.32]	(118.32, 179.92]	(179.92, 398.86]
WR/10⁴m³	(0.33, 491.92]	(491.92, 618.26]	(618.26, 1696.78]
WS/10⁴m³	(10189.05, 13878.35]	(13878.35, 16709.5]	(16709.5, 18686.78]

Table 4

The discretization results of D = 4.

Feature	Category and interval of D = 4
Feature	1	2	3	4
DWU/10⁴m³	(4283.1, 5026.03]	(5026.03, 5750.3]	(5750.3, 6311.27]	(6311.27, 7107.69]
IWU/10⁴m³	(2689.23, 3685.71]	(3685.71, 4538.36]	(4538.36, 4768.16]	(4768.16, 6066.46]
SIWU/10⁴m³	(1663.95, 2468.94]	(2468.94, 3195.45]	(3195.45, 4590.63]	(4590.63, 5549.5]
FP/10⁴people	(348.88, 475.54]	(475.54, 601.22]	(601.22, 731.63]	(731.63, 856.12]
R/mm	(30.07, 94.29]	(94.29, 151.94]	(151.94, 207.23]	(207.23, 398.86]
WR/10⁴m³	(0.33, 494.99]	(494.99, 524.72]	(524.72, 632.22]	(632.22, 1696.78]
WS/10⁴m³	(10189.05, 12928.82]	(12928.82, 15712.46]	(15712.46, 17393.34]	(17393.34, 18686.78]

It was found through experiments that when the number of frequent item is greater than two items, there is almost certainly a strong association degree between features and WS. This is because the feature strongly associated with WS has dominated the SAR, which is equivalent to give this feature a large weight. If a new feature which has a strong association degree with WS is added to the two-item SAR, the three-item SAR will undoubtedly be has a strong association degree with WS. Even if a feature that is completely uncorrelated with WS is added to the two-item SAR, the three-item SAR still has a strong association degree with WS due to the existence of the feature which has a large weight. Therefore, this study only focuses on the one-item SAR and two-item SAR, which can also reduce the calculation load of data mining. The data mining results of D = 3 are shown in Tables 5 and 6. The SAR are sorted in order of lift degree from highest to lowest. The letter inside the braces represents the feature, and the number outside the braces represents the category to which the feature belongs. This category represents a detailed value range of the feature. The first rule in Table 5 is taken as an example, the FP and WS are directly associated. When the FP is in category 1 (FP is within the interval (348.88, 509.95] 104people) and the WS is in category 1 (WS is within the interval (10189.05, 13878.35] 104m3), the support degree, confidence degree and lift degree of SAR are 0.13, 0.71 and 4.73 respectively, showing the strongest association degree. As can be seen from rule 11–14, the lift degree of these SARs is less than 1, revealing that they are spurious SAR and should be discarded.

Table 5

The one-item SAR of D = 3.

Number	SAR	Support degree	Confidence degree	Lift degree
1	{FP}1→{WS}1	0.13	0.71	4.73
2	{DWU}1→{WS}1	0.12	0.66	4.35
3	{SIWU}1→{WS}1	0.09	0.58	3.84
4	{SIWU}3→{WS}3	0.26	0.86	2.81
5	{FP}3→{WS}3	0.22	0.81	2.64
6	{DWU}3→{WS}3	0.21	0.68	2.21
7	{FP}2→{WS}2	0.44	0.81	1.49
8	{SIWU}2→{WS}2	0.43	0.81	1.49
9	{DWU}2→{WS}2	0.38	0.74	1.38
10	{WR}3→{WS}2	0.27	0.62	1.04
11	{R}3→{DWU}2	0.28	0.53	0.94
12	{R}3→{FP}2	0.29	0.56	0.92
13	{R}3→{WS}2	0.28	0.53	0.88
14	{R}3→{SIWU}2	0.26	0.50	0.83

Table 6

The two-item SAR of D = 3.

Number	SAR	Support degree	Confidence degree	Lift degree
1	{FP}1 and {DWU}1→{WS}1	0.11	0.91	6.04
2	{FP}1 and {IWU}1→{WS}1	0.10	0.90	5.99
3	{FP}1 and {SIWU}1→{WS}1	0.09	0.82	5.42
4	{FP}3 and {SIWU}3→{WS}3	0.22	0.98	3.18
5	{FP}3 and {DWU}3→{WS}3	0.18	0.95	3.08
6	{R}1 and {SIWU}3→{WS}3	0.08	0.82	2.34
7	{FP}2 and {IWU}2→{WS}2	0.12	0.96	1.77
8	{FP}2 and {SIWU}2→{WS}2	0.35	0.85	1.57
9	{FP}2 and {DWU}2→{WS}2	0.30	0.81	1.50
10	{R}3 and {SIWU}2→{WS}2	0.23	0.87	1.43
11	{WR}3 and {DWU}2→{WS}2	0.19	0.85	1.43
12	{WR}3 and {SIWU}2→{WS}2	0.21	0.82	1.39
13	{R}3 and {DWU}2→{WS}2	0.20	0.71	1.17

The FP, DWU, SIWU and WS in the SARs belong to the same category (Table 5). The FP has the strongest association degree with WS, followed by DWU and SIWU, and they are all in category 1. This indicates that when the category of features belongs to the first category, the influence of FP on WS is higher than that of DWU and SIWU. According to rules 7–9, the three SARs belong to the second category, but association degree between SIWU and WS exceeds that between DWU and WS. In the second category, FP still has the strongest association degree with WS, revealing that the influence of FP on WS is higher than that of SIWU and DWU. According to rule 4–6, the three SARs belong to the third category, but association degree between SIWU and WS exceeds that between FP and WS. The influence of features on WS from high to low is SIWU, FP and DWU. These results reveals that the association degree between features and WS is affected by the category to which the feature belongs. It can be seen that when the category of features become larger, the SIWU becomes more sensitive to the WS. The higher the category SIWU belongs to, the stronger the association degree between SIWU and WS. For the FP, the SAR between FP and WS has the strongest association degree when the FP belongs to the first category and the second category. Specifically, the WS is most easily affected by FP when WS is within the interval (10189.05, 16709.5] 104m3, and the WS is most easily affected by SIWU when WS is within the interval (16709.5, 18686.78] 104m3. Although the association degree between SIWU and WS in the third category exceeds association degree between FP and WS, the gap of association degree is not large. This result indicates that FP has a large influence on WS in the whole value range of WS. Similarly, the DWU has the largest influence on WS when DWU is within the interval (4283.1, 5261.45] 104m3. Therefore, the more FP, SIWU and DWU, the more WS. There is a positive association relationship between FP, SIWU, DWU and WS. The results show that the relationship between features found by the scatter density plot matrix is consistent with the results of data mining. The association degree of features is not completely unchanged, and there is different association degree in different categories. Although the SAR between R and DWU, FP, WS, SIWU have passed validation of the support degree threshold and confidence degree threshold, the lift degree of these SARs is all less than 1, indicating that they are all spurious SAR (Table 5). The results mean R does not have valid one-item SAR. However, the SAR between WR and WS meet the lift degree threshold, so this SAR is a valid SAR. But the association degree of this SAR is the lowest and they do not belong to the same category. It can be seen that the more WR, the less WS, so there is a negative association relationship between them. According to Table 6, eight SARs include FP, which reveals that FP also has a strong association degree with WS in two-item SARs. The lift degree of the first SAR in Table 5 is 4.73. As can be seen from rules 1–3 in Table 6, all the features belong to the first category, and the lift degree of the three SARs is greater than 4.73. This phenomenon reveals that the two-item SARs including FP have higher lift degree than the one-item SAR including only FP. It can be seen from rule 1–3 in Table 6, in the first category, FP is more likely to affect WS through affecting DWU, IWU and SIWU, among which FP is most likely to affect WS through affecting DWU. The similar results are found in the second category and the third category. In the second category, the lift degree of SAR between FP and WS in Table 5 is 1.49, and the lift degree of the rules 7–9 in Table 6 is greater than 1.49. Although FP still affects WS through affecting DWU, IWU, and SIWU, FP is most likely to affect WS through affecting IWU. In the third category, the lift degree of SAR between SIWU and WS in Table 5 is 2.81, and the lift degree of the rules 4 and 5 in Table 6 is greater than 2.81, so the FP is most likely to affect WS by affecting SIWU. However, in the third category, the SAR between FP, IWU and WS does not appear in Table 6, which indicates that when WS belongs to the third category, the impact of FP on WS has little association with IWU. Meanwhile, based on the confidence degree of SAR, the value interval of WS can be inferred. For example, according to the rule 4 in Table 6, when the FP is within (676.51, 856.12] 104people and the SIWU is within (3958.37, 5549.5] 104m3, it can be inferred that the probability that WS is within (16709.5, 18686.78] 104m3 is 0.98. Although there is no valid SAR between R and WS in the one-item SARs, it can be seen from rule 6, 10 and 13 in Table 6 that the R is indirectly associated with WS through SIWU and DWU. But they do not belong to the same category, so there is a negative association relationship between them. This shows that the less R, the more WS, and the more R, the less WS. When the R belongs to the first category, it affects WS through affecting SIWU. When R belongs to the third category, it affects WS through affecting SIWU and DWU, and it is more likely to affect WS by affecting SIWU. Similarly, the WR affects WS through affecting DWU and SIWU, and it is more likely to affect WS by affecting DWU. Tables 7 and 8 show the data mining results of D = 4. It’s similar to the results of D = 3, the FP, SIWU, DWU and WS in SARs belong to the same category, while the R, WR and WS belong to the different categories. There is a positive association relationship between FP, SIWU, DWU and WS, and there is a negative association relationship between WR, R and WS. However, although there is not much difference between the maximum lift degree in Table 5 and the maximum lift degree in Table 7, there is a big gap between the maximum lift degree in Table 6 and the maximum lift degree in Table 8.

Table 7

The one-item SAR of D = 4.

Number	SAR	Support degree	Confidence degree	Lift degree
1	{SIWU}4→{WS}4	0.09	0.69	5.11
2	{FP}4→{WS}4	0.09	0.53	3.91
3	{FP}2→{WS}2	0.29	0.79	2.29
4	{SIWU}2→{WS}2	0.20	0.72	2.10
5	{DWU}2→{WS}2	0.23	0.68	1.97
6	{FP}3→{WS}3	0.30	0.78	1.78
7	{SIWU}3→{WS}3	0.33	0.74	1.69
8	{DWU}3→{WS}3	0.22	0.63	1.45
9	{WR}4→{WS}3	0.21	0.53	1.20

Table 8

The two-item SAR of D = 4.

Number	SAR	Support degree	Confidence degree	Lift degree
1	{FP}2 and {IWU}2→{WS}2	0.11	0.95	2.78
2	{FP}2 and {DWU}2→{WS}2	0.19	0.93	2.69
3	{FP}2 and {SIWU}2→{WS}2	0.18	0.78	2.26
4	{R}1 and {SIWU}3→{WS}3	0.17	0.86	1.98
5	{FP}3 and {SIWU}3→{WS}3	0.24	0.84	1.91
6	{FP}3 and {DWU}3→{WS}3	0.17	0.80	1.83
7	{R}1 and {DWU}3→{WS}3	0.10	0.77	1.76
8	{R}4 and {SIWU}3→{WS}3	0.14	0.68	1.54
9	{R}4 and {DWU}3→{WS}3	0.11	0.56	1.29
10	{WR}4 and {SIWU}3→{WS}3	0.17	0.87	1.98
11	{WR}4 and {DWU}3→{WS}3	0.10	0.83	1.90

As can be seen from Table 7, in the second and third categories, the order of SAR of the features remains unchanged, indicating that the sensitivity of the features to WS decreases. In addition, when the feature belongs to the fourth category, there is no valid SAR between DWU and WS. This result indicates that the DWU has little influence on WS when WS is within (17393.34, 18686.78] 104m3, the results of rule 1–2 indicate that WS is mainly affected by SIWU and FP, and the association degree between SIWU and WS is the strongest. There is a valid SAR between WR and WS, and the association degree of this SAR is weakest. But there is no spurious SAR. According to the Table 8, its results are similar to the results in Table 6. In the second category, the FP affects WS through affecting IWU, DUW and SIWU, and FP is most likely to affect WS by affecting IWU. In the third category, FP is most likely to affect WS by affecting SIWU. Similarly, the WR and R affects WS through affecting SIWU and DUW, and WR and R are more likely to affect WS by affecting SIWU. Meanwhile, according to the rule 4, 7, 8 and 9, when R belongs to the first category, WS belongs to the third category. But when R belongs to the fourth category, WS still belongs to the third category, not the second category. This result shows that although R can affect WS by affecting SIWU and DUW, it has little influence on water supply fluctuation. Finally, the distribution plots of valid SARs are shown in Figs 4–7 respectively, and the color bars represent the association degree of SARs.

Fig 4

The distribution plot of one-item valid SARs of D = 3.

Fig 7

The distribution plot of two-item valid SARs of D = 4.

Through the above comparative analysis, it can be seen that the analysis results of different D have their own advantages. The average lift degree of valid SARs in D = 3 is higher than that of valid SARs in D = 4. Meanwhile, when D is equal to 3, the number of valid SARs is larger, and the sensitivity degree of features is higher. In the two-item SAR, the maximum lift degree of SARs in D = 4 was significantly lower than that of SARs in D = 3. Although there are spurious SARs in the results of D = 3, they can easily be filtered out by lift degree threshold, so the optimal value of D is 3.

Discussion

In this study, the most critical parameters are support degree threshold and confidence degree threshold. We will present our experience in this section. Although the lift degree threshold is important, its definition already determines the threshold so that it does not need to be set manually. As mentioned in section 2.2 and 2.3, even though the P(AB) is small, it does not mean that the association degree between A and B is weak, this is because the number of total sample is very large. Therefore, the support degree threshold should be set as small as possible to prevent some potential SARs from being filtered out. Meanwhile, the support degree threshold and confidence degree threshold will affect the calculation load and efficiency of the algorithm, so the confidence degree threshold should not be set too small. In order to compare the data mining results of different thresholds, the confidence degree threshold is set to 0.1, while the support degree threshold remains unchanged. The data mining results are presented in Tables 9 and 10, and the SARs that appear in the results of section 4.2 is no longer displayed.

Table 9

The SAR difference of different confidence degree thresholds in D = 3.

Number	SAR	Support degree	Confidence degree	Lift degree
1	{IWU}2→{WS}2	0.15	0.49	0.91
2	{IWU}1→{WS}1	0.12	0.42	2.77
3	{IWU}3→{WS}3	0.12	0.29	0.96
4	{R}1→{WS}3	0.10	0.18	0.50
5	{R}1 and {WR}3→{WS}3	0.10	0.22	0.62
6	{R}3 and {IWU}2→{WS}2	0.09	0.46	0.76

Table 10

The SAR difference of different confidence degree thresholds in D = 4.

Number	SAR	Support degree	Confidence degree	Lift degree
1	{DWU}4→{WS}4	0.09	0.44	3.22
2	{IWU}2→{WS}2	0.16	0.38	1.09
3	{R}4→{WS}3	0.20	0.46	1.06
4	{R}1→{WS}3	0.21	0.43	0.97

According to the six SARs in Table 9, it can be seen that only one SAR is valid and the lift degree of this SAR is 2.77. As can be seen from rules 1–3 in Table 5, the lift degree of the three SARs is higher than 2.77, indicating that the association degree between IWU and WS in the first category is not very strong. Therefore, the WS is more likely to be affected by FP, DWU and SIWU. Similarly, according to the results in Table 10, the lift degree of the first SAR is less than that of rules 1 and 2 in Table 7, and the lift degree of the second SAR is less than that of rules 3–5 in Table 7. Although the third SAR does not appear in Table 7, the lift degree of this SAR approaches 1, indicating that R and WS are almost independent of each other. Therefore, the data mining results with smaller confidence degree threshold have almost no influence on the results of this study, but it affects the calculation load and efficiency of the algorithm to some extent. As can be seen from the second SAR in Table 7, the confidence degree of this SAR is 0.53, but the lift degree of this SAR is 3.91. In order to preserve the SAR that has a high lift degree, so the confidence degree threshold cannot be increased. The support degree threshold set in this study is already relatively small, and even if the support degree threshold is set to a smaller value, it just generates more frequent items. However, the association rules still need to be filtered by confidence degree threshold, so the data mining results of smaller support degree threshold are not presented in this study. The value of D is inversely proportional to the width of the value interval. The smaller the width of the value interval, the easier it is to accurately determine the fluctuation interval of the WS. If we want to reduce the width of the value interval, we need to increase the D. The larger the value of D is, the less the number of valid SAR. At the same time, the maximum lift degree of the two-item SARs and feature sensitivity will decrease. Therefore, this is a classic problem of zero-sum game, and we have been devoted to finding the optimal scheme of D in zero-sum game. Other analysis methods, such as these studies [42, 43], can only recognize sensitive features, but do not know how the features are associated with each other, and do not know whether their association relationship is direct or indirect. At the same time, the features recognized by these methods have not been validated. With the change of value interval of features, it is not known whether these sensitive features are still valid. Therefore, the analysis results of these methods are not reliable, and the method in this study is superior and more reliable.

Conclusions

The purpose of this study is to carry out association analysis between features and WS through data mining to understand the cause of water supply fluctuation and the fluctuation interval of water supply, so as to realize attribution analysis of water supply fluctuation, and provide strong support for water supply dispatching. In this study, to improve the reliability of the analysis method, a data mining method coupling kmeans clustering discretization and apriori algorithm was proposed to carry out more reliable association analysis. The data discretization can avoid the influence of multicollinearity and monotone relationship during the analysis process. The scatter density plot matrix is used to intuitively discover the correlation relationship of data and explore the density distribution of the single feature and the two-dimensional density distribution between two features. The kmeans clustering algorithm is applied to carry out the discretization of continuous data, and the apriori algorithm is used to carry out association analysis. The method in this study can not only obtain valid SARs which has been validated and the association degree of SAR, but also know the value interval of features. The results also show that the association relationship and association degree of the SAR is not completely unchanged, but closely associated with the value interval of features. This method in the study is a novel method for association analysis, which is more valid and reliable than current analysis methods. In addition, the study also provides guidance for avoiding the influence of multicollinearity and monotone relationship on the analysis results in the process of data analysis. At present, the zero-sum problem of D has not been solved. We have been devoted to solving this problem. In particular, the number of SAR and the sensitivity of features should not be reduced while increasing the D. In the future, this research could be extended widely. The next step of the research work is to continue to design and develop discretization methods and analysis algorithms to compare their results. We can collect more data for association analysis to improve performance of algorithm, and use different languages instead of Python to discover the advantages of different languages in data mining. We will also use the methods of this study in other cities and compare the analysis results with those in Shenzhen. 9 Jun 2021 PONE-D-21-16186 The water supply association analysis method in Shenzhen based on kmeans clustering discretization and apriori algorithm PLOS ONE Dear Dr. Sang, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Jul 24 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Zaher Mundher Yaseen Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section 3. We note that Figure 1 in your submission contain [map/satellite] images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright. We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission: a. You may seek permission from the original copyright holder of Figure(s) [#] to publish the content specifically under the CC BY 4.0 license. We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text: “I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.” Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission. In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].” b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only. The following resources for replacing copyrighted map figures may be helpful: USGS National Map Viewer (public domain): http://viewer.nationalmap.gov/viewer/ The Gateway to Astronaut Photography of Earth (public domain): http://eol.jsc.nasa.gov/sseop/clickmap/ Maps at the CIA (public domain): https://www.cia.gov/library/publications/the-world-factbook/index.html and https://www.cia.gov/library/publications/cia-maps-publications/index.html NASA Earth Observatory (public domain): http://earthobservatory.nasa.gov/ Landsat: http://landsat.visibleearth.nasa.gov/ USGS EROS (Earth Resources Observatory and Science (EROS) Center) (public domain): http://eros.usgs.gov/# Natural Earth (public domain): http://www.naturalearthdata.com/ [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: 1. The manuscript presents the water supply association analysis method in Shenzhen based on kmeans clustering discretization and apriori algorithm, which is interesting. It is relevant and within the scope of the journal. 2. However, the manuscript, in its present form, contains several weaknesses. Appropriate revisions to the following points should be undertaken in order to justify recommendation for publication. 3. For readers to quickly catch your contribution, it would be better to highlight major difficulties and challenges, and your original achievements to overcome them, in a clearer way in abstract and introduction. 4. p.1 - a data mining coupling method is adopted for water supply attribution analysis. What are other feasible alternatives? What are the advantages of adopting this approach over others in this case? How will this affect the results? The authors should provide more details on this. 5. p.1 - kmeans clustering discretization is adopted to solve the optimal discretization degree to avoid multicollinearity problem. What are the other feasible alternatives? What are the advantages of adopting this soft computing technique over others in this case? How will this affect the results? More details should be furnished. 6. p.1 - apriori algorithm is adopted to solve the optimal discretization degree to obtain association rule and association degree. What are the other feasible alternatives? What are the advantages of adopting this algorithm over others in this case? How will this affect the results? More details should be furnished. 7. p.2 - Shenzhen is adopted as the case study. What are other feasible alternatives? What are the advantages of adopting this case study over others in this case? How will this affect the results? The authors should provide more details on this. 8. p.3 - historical records of 2004 to 2019 are taken. Why are more recent data not included in the study? Is there any difficulty in obtaining more recent data? Are there any changes to situation in recent years? What are its effects on the result? 9. p.3 - the discretization results of three methods as shown in Fig. 2 are adopted in this study What are the other feasible alternatives? What are the advantages of adopting these methods over others in this case? How will this affect the results? More details should be furnished. 10. p.5 - Eq. 10 is adopted as the objective function. What are other feasible alternatives? What are the advantages of adopting this function over others in this case? How will this affect the results? The authors should provide more details on this. 11. p.5 - Eqs. 11 and 12 are adopted as the minimum support degree thresholds. What are the other feasible alternatives? What are the advantages of adopting these thresholds over others in this case? How will this affect the results? More details should be furnished. 12. p.6 - six features are adopted in the study. What are the other feasible alternatives? What are the advantages of adopting these features over others in this case? How will this affect the results? More details should be furnished. 13. p.6 - spearman's rank correlation coefficient is adopted to select the features. What are the other feasible alternatives? What are the advantages of adopting this approach over others in this case? How will this affect the results? More details should be furnished. 14. p.6 - a scatter density diagram matrix is adopted to guide subsequent analysis. What are the other feasible alternatives? What are the advantages of adopting this approach over others in this case? How will this affect the results? More details should be furnished. 15. p.8 - “…Compared with D = 4, D = 3 is considered to be better in this study, because the.…” More justification should be furnished on this issue. 16. p.8 - SAR with one-item set and two-item set are adopted in the experiments. What are the other feasible alternatives? What are the advantages of adopting this approach over others in this case? How will this affect the results? More details should be furnished. 17. Some key parameters are not mentioned. The rationale on the choice of the particular set of parameters should be explained with more details. Have the authors experimented with other sets of values? What are the sensitivities of these parameters on the results? 18. Some assumptions are stated in various sections. Justifications should be provided on these assumptions. Evaluation on how they will affect the results should be made. 19. The discussion section in the present form is relatively weak and should be strengthened with more details and justifications. 20. Moreover, the manuscript could be substantially improved by relying and citing more on recent literatures about contemporary real-life case studies of modelling and/or optimization techniques in water distribution systems such as the followings. Discussions about result comparison and/or incorporation of those concepts in your works are encouraged: � Zheng FF, et al. “Improved Understanding on the Searching Behavior of NSGA-II Operators Using Run-Time Measure Metrics with Application to Water Distribution System Design Problems,” Water Resources Management 31 (4): 1121-1138 2017. � Shende S, et al. “Design of water distribution systems using an intelligent simple benchmarking algorithm with respect to cost optimization and computational efficiency,” Water Supply 19 (7): 1892-1898 2019. � Sedki A, et al. “Hybrid particle swarm optimization and differential evolution for optimal design of water distribution systems,” Advanced Engineering Informatics 26 (3): 582-591 2012. � Oyebode O, et al. “Evolutionary modelling of municipal water demand with multiple feature selection techniques,” Journal of Water Supply Research and Technology-AQUA, 68 (4): 264-281 2019 21. Some inconsistencies and minor errors that needed attention are: � Replace “…So Compared with D = 4, D = 3 is considered to be…” with “…So compared with D = 4, D = 3 is considered to be…” in p.8 22. In the conclusion section, the limitations of this study and suggested improvements of this work should be highlighted. Reviewer #2: The manuscript entitled “The water supply association analysis method in Shenzhen based on kmeans clustering discretization and apriori algorithm” looks an applied research based mainly on the kmeans clustering and apriori algorithm that had been applied on Shenzhen, a city located to south of China. It was well structured and written in good way. However I recommend a major revision to address the following points: Abstract The research abstract seems almost vague as if it is a fragmented part of the research. The abstract should be reformulated to be easy to understand and expressive of the overall research content. Introduction I think the introduction was written in a format that could be described as good, but it still did not live up to the preferences. It is better to divide the introduction to include the scientific background of the main topic and the methods of work used in the research. After that, the research problem should be stated clearly and the reserch importance addressed. Then the previous studies in the field of research are listed. Accordingly, the introduction should be written in a more effective way to cover all aspects of the topic. The subject of the introduction focused on aspects and overlooked other important aspects. Some mathematical methods have been presented that can contribute to establishing a certain guess, while the water supply itself is not given any importance in terms of clarifying the factors controlling it in theory. The research problem should be clearly clarified in a separate sentence, and then presenting the originality of the idea and the importance of the research, what is the intended final benefit, and whether the research aims at any knowledge addition at the academic or applied research level. What makes the research subject to criticism is not to address more previous studies and present their most important conclusions. It is also preferable to redesign the target and write it in a clear text, away from ambiguity. I would prefer to re-state the aim of study “This study can establish valid SAR among features and recognize association degree of SAR. Finally, the value range of features in SAR can be obtained through the association rules”. Materials and methods - The situation of study area was included within the materials and methods section! It is not suitable to be placed in this section; it is better to be as an independent section. - At last paragraph of the Materials and Methods, the author has stated “The data for this study came from the monthly data of Shenzhen Bureau of Water, the Shenzhen Bureau of Statistics, the Shenzhen Water Group and Digital Water System from 2004 to 2019. The methods in this study were developed in Python3”. This topic is not related directly to the study area but rather to the methods of work and data collection, so I would prefer to transfer it to suitable place. What colors (green, orange, and blue) mean in Fig. 2.? I do not understand why the author resorts to drawing equations for some common topics, and if he referred to computer software programs, it would be better, for example “spearman's rank correlation coefficient”. The scatter diagrams need to be explained how clarifies the relationships between different features. The expression 3D = 3 in section 2.2. Mining results and discussion: is not accepted to be written by this style, please. The highest confidence degree in Table 3 is low (0.7)? Can such a result be taken into consideration? ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Salih Muhammad Awadh [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. Submitted filename: Reviewer comments.docx Click here for additional data file. 5 Jul 2021 Journal Requirements: 1. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Answer: the protocol has been created. Since our experiments were done separately, and we might apply for a patent based on this method, so we only uploaded data set and code demonstration. However, the code can run smoothly and generate the results, and the same is true for other categories of association analysis. https://www.protocols.io/workspaces/water-supply-association-analysis 2. Ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. Answer: we agree to revise. The style in the paper had been revised. 3. The grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. Answer: we have checked it carefully. Based on Submission Guidelines, the funding sources in Acknowledgments have been removed. The National Key Research and Development Program of China and National Natural Science Foundation of China have grant numbers, but the Innovation Foundation of North China University of Water Resources and Electric Power for PhD Graduates do not have a grant number, because this Innovation Foundation is actually an honor institution established by the university. The fund is provided to outstanding doctors for scientific research and innovation. Although it is named as Innovation Foundation, it is not an official organization. It is just an honor organization, so there is no grant number. But the money does go to research, so the university requires that the funding information should be included in the paper. 4. We note that Figure 1 in your submission contain [map/satellite] images which may be copyrighted. Answer: we agree to revise. The main content of this study is data mining, which is not closely related to the location of the study area, so Figure 1 has been removed. For the comments of reviewer #1 The first comment and the second comment are statements by the reviewer, so they do not need to be revised. 1. 3. For readers to quickly catch your contribution, it would be better to highlight major difficulties and challenges, and your original achievements to overcome them, in a clearer way in abstract and introduction. Answer: we agree to revise. We rewrote the abstract and introduction. Major difficulties, challenges and the original achievements to overcome them have been stated in the abstract and introduction. We have also revised some inappropriate use and language in the abstract and introduction. 2. 4. A data mining coupling method is adopted for water supply attribution analysis. What are other feasible alternatives? What are the advantages of adopting this approach over others in this case? How will this affect the results? Answer: Unfortunately, through extensive experiments and researches, we have not yet developed a feasible alternative. However, we believe that as long as an algorithm can be analyzed based on discrete data and the analysis results can be validated by certain means, so this method is feasible alternative. We have taken this problem as our current key research direction, hoping to find another effective data mining method of association rule, and compare the two methods to find advantages and disadvantages of different methods. At the same time, we will try different computing languages to compare the advantages of languages in data mining. The mainstream analysis methods can only recognize sensitive features, but do not know how the features are associated with each other, and do not know whether their association relationship is direct or indirect. At the same time, the features recognized by these methods have not been validated. With the change of value interval of features, it is not known whether these sensitive features are still valid. The data mining method based on discrete data proposed in this study solves the above problems, obtains association rules and filters out invalid rules through validation. In addition, we can obtain the value range of features in rules, the analysis results are reliable. Therefore, the results of these mainstream analysis methods are not reliable, and the method in this study is superior and more reliable. These contents have been presented in the introduction, results and conclusions. 3. 5. The kmeans clustering discretization is adopted to solve the optimal discretization degree to avoid multicollinearity problem. What are the other feasible alternatives? What are the advantages of adopting this soft computing technique over others in this case? How will this affect the results? Answer: The commonly used discretization methods include equal width discretization, equal frequency discretization and clustering discretization. Their discretization results and defect have been described in section 2.1 (Discretization) of the paper. The quality of discretization interval directly affects the quality of the analysis results. The bad discretization interval may cause the algorithm to miss some valuable rules. Through experiments, it is found that the discretization results of clustering method are optimal. Kmeans is recognized as a method with good effect and low computation load among the current cluster analysis methods, and this method can generate interval breakpoints and has self-adaptive ability. Other methods cannot map the analysis results to a certain interval. The research methods iteratively carry out the algorithm for different intervals, and the association rules of different association degrees can be mapped to different sensitive intervals, making the analysis results more clear. Kmeans algorithm is self-adaptive, and we can also manually set the parameters of the algorithm, such as random state and maximum iteration times of the algorithm, which is also one of the reasons for the good analysis effect. These contents have been revised in section 2.1 (Discretization) of the paper. 4. 6. The apriori algorithm is adopted to solve the optimal discretization degree to obtain association rule and association degree. What are the other feasible alternatives? What are the advantages of adopting this algorithm over others in this case? How will this affect the results? Answer: the other feasible alternative is fpgrowth algorithm. For fpgrowth algorithm, events are mapped to a path in the FP tree to construct the tree structure. Although the two algorithms are different, but the data mining results is the same, so we're not going to repeat it. As for mainstream association methods, such as clustering, neural network, similarity, time series, can not effectively obtain the analysis results in this study. Compared with fpgrowth algorithm, the apriori algorithm adopts the iterative method of searching layer by layer, the algorithm is simple and has better performance, and the algorithm is easy to be developed. The advantages of adopting this algorithm have been stated in section 1 (Introduction) and section 2 (Materials and methods). 5. 7. Shenzhen is adopted as the case study. What are other feasible alternatives? What are the advantages of adopting this case study over others in this case? How will this affect the results? The authors should provide more details on this. Answer: the research method is data-driven, so any city that can provide detailed data can be used as case study. Our team is participating in the construction of the first stage of Shenzhen Wisdom Water Project, which includes attribution analysis of water supply fluctuations. Shenzhen is a prototype of socialist modern city construction and Shenzhen is a pilot city of wisdom water and intelligent dispatching in China. So there's a lot of monitoring devices and metering devices in Shenzhen, and there's also big data systems and digital water system that allow us to get detailed data. Wisdom water is being gradually promoted in China, and we will apply the research methods to cities that can provide detailed data. Meanwhile, we really hope to use the detailed data provided by other cities for analysis and compare with the analysis results of Shenzhen to find the advantages. We also hope that this research method can provide support for the water supply of Shenzhen and other cities in China. The data mining methods can find the hidden information that the data cannot tell us, and obtain the previously unknown and valuable knowledge. But it is impossible to carry out data mining without enough data This part of the content is not relevant to the research, so it was not revised in the paper. 6. 8. Historical records of 2004 to 2019 are taken. Why are more recent data not included in the study? Is there any difficulty in obtaining more recent data? Are there any changes to situation in recent years? What are its effects on the result? Answer: The coming period is a critical period for Shenzhen to build socialist modernization, a crucial period for implementing the construction of the Guangdong-Hong Kong-Macao Greater Bay Area. Shenzhen is also a special economic zone in China, and the water supply of Hong Kong mainly relies on the Shenzhen Reservoir. Therefore, the recent measured data was considered confidential, and our team was unable to obtain the recent measured data. Shenzhen is the first city in China to be fully urbanized, and it has complete water, land, air, and railway ports. Therefore, the urban functions are complete, so there are no big changes in Shenzhen in recent years. Data mining method is a data-driven method that find valuable information from data, so changes in the external environment will not affect the data mining method. This part of the content is not relevant to the research, so it was not revised in the paper. 7. 9. The discretization results of three methods as shown in Fig. 2 are adopted in this study What are the other feasible alternatives? What are the advantages of adopting these methods over others in this case? How will this affect the results? More details should be furnished. Answer: the other feasible alternatives are supervised methods. The data discretization methods include supervised methods and unsupervised methods, and the classification criterion is whether the data contains category information. The supervised discretization takes into account category information while unsupervised discretization does not. If supervised discretization method is to be used, manual annotation of data is needed to add category information, which is very complicated, and the manual annotation results may have many kinds. Therefore, unsupervised discretization methods are used more widely, and the unsupervised discretization methods are selected in this study. The unsupervised discretization methods include equal width discretization, equal frequency discretization and clustering discretization. The quality of discretization interval directly affects the quality of the analysis results. The bad discretization methods may cause damage in the data mining process. These contents have been revised in the section 2.1 (Discretization). 8. 10. Eq. 10 is adopted as the objective function. What are other feasible alternatives? What are the advantages of adopting this function over others in this case? How will this affect the results? The authors should provide more details on this. Answer: Since the objective function is set according to the idea of the researcher, we believe that as long as the function is reasonable, it can be used as an alternative. Since association rules obtained by data mining methods need to be validated, the purpose of this study is not only to obtain more association rules, but also to obtain more effective association rules. We may lose quality if we focus only on quantity of SAR, and We may lose quantity if we focus only on quality of SAR. Therefore, the objective function set in this study is that the average lift degree of the valid association rules is maximum. This objective function makes the analysis results of iterative calculation more reliable, so we can get a more reliable and universal method. These contents have been revised in the section 2.3 (Coupling method). 9. 11. Eqs. 11 and 12 are adopted as the minimum support degree thresholds. What are the other feasible alternatives? What are the advantages of adopting these thresholds over others in this case? How will this affect the results? More details should be furnished. Answer: the number of feasible alternatives is enormous, because this threshold can be set manually. Threshold is also need to meet several criteria, such as computation load and efficiency. If the support degree threshold is set too high, although it can reduce the time to calculate frequent item sets in data mining, it is easy to cause some association items hidden in the data to be filtered out. Because the confidence degree need to be calculated after support degree, so the support degree threshold should be set as small as possible. If the confidence degree threshold is set too low, a large number of invalid rules may be generated, leading to a high calculation load and greatly increasing the time of data mining. Therefore, Therefore, this study gives consideration to both calculation load and efficiency, and combined with previous data mining experience to finally set these thresholds. In addition, we also set a small confidence degree threshold in the results and discussion to compare the differences in the analysis results. Because no matter how the support degree threshold is set, rules need to be filtered by confidence degree threshold, so this study only discusses the difference of the confidence degree threshold. These contents have been revised in section 2.3 (Coupling method), section 4.2 (Experiment results) and section 5 (Discussion). 10. 12. Six features are adopted in the study. What are the other feasible alternatives? What are the advantages of adopting these features over others in this case? How will this affect the results? More details should be furnished. Answer: the water use of Shenzhen consists of domestic water use, industrial water use, service industry water use, ecological water use, agricultural water use and construction industry water use. The other feasible alternatives include ecological water use, agricultural water use and construction industry water use. The domestic water use, industrial water use, service industry water use were selected by spearman's rank correlation coefficient. Shenzhen is a fully urbanized city with a floating population of more than 8 million. Shenzhen has abundant rainfall, with an average annual rainfall of 1830 mm. At the same time, a large amount of water use has produced a large amount of wastewater, so the amount of wastewater reuse in Shenzhen is increasing. The six features have great influence on water supply, and at the same time, these features are more targeted. Other features have little relationship with water supply fluctuations, so they cannot generate valid strong association rule between them and water supply. At the same time, we can reduce the calculation load of data mining by not choosing them. This has been added in section 3.2 (Data description). 11. 13. Spearman's rank correlation coefficient is adopted to select the features. What are the other feasible alternatives? What are the advantages of adopting this approach over others in this case? How will this affect the results? More details should be furnished. Answer: the other feasible alternative is Pearson correlation coefficient. The Pearson correlation coefficient is an evaluation method of the linear relationship between two variables. When the variables do not follow a normal distribution or have more complex linear correlation degree, the Pearson coefficient is no longer valid. Spearman's rank correlation coefficient evaluates a monotone relationship rather than a linear relationship between two variables. Spearman coefficient does not require prior knowledge, and its application scope is wider than Pearson coefficient. The analysis results of different correlation coefficient may be different. This study focuses on the attribution analysis of water supply and searching for valid strong association rules, so it is not suitable to use linear relationship as the standard. This has been added in Section 3.2 (Data description). 12. 14. A scatter density diagram matrix is adopted to guide subsequent analysis. What are the other feasible alternatives? What are the advantages of adopting this approach over others in this case? How will this affect the results? More details should be furnished Answer: the other alternatives include Pearson correlation coefficient and Spearman's rank correlation coefficient, which can also be used to describe the relationship between variables. But the relationships they can describe are very limited. The scatter density plot matrix can describe a variety of complex relationships, such as the density distribution of the single feature, the density relationship and correlation relationship between features, and the function of this method is strong. The exploratory data analysis (EDA) can let us intuitively understand the characteristics of data and know the potential relationship between them, which is a valid method of data analysis. Meanwhile, EDA helps discover what the data is trying to tell us and can be used to look for patterns and relationships to guide our subsequent analysis. This has been added in section 4.1 (Exploratory data analysis). 13. 15. “…Compared with D = 4, D = 3 is considered to be better in this study, because the.…” More justification should be furnished on this issue. Answer: we agree to revise. The content has been added to section 4.2 (Experiment results). 14. 16. SAR with one-item set and two-item set are adopted in the experiments. What are the other feasible alternatives? What are the advantages of adopting this approach over others in this case? How will this affect the results? More details should be furnished. Answer: the other feasible alternatives are three-item SAR and four-item SAR. This is because the feature strongly associated with WS has dominated the SAR, which is equivalent to give this feature a large weight. If a new feature which has a strong association degree with WS is added to the two-item SAR, the three-item SAR will undoubtedly be has a strong association degree with WS. Even if a feature that is completely uncorrelated with WS is added to the two-item SAR, the three-item SAR still has a strong association degree with WS due to the existence of the feature which has a large weight. Take the first rule in Table 6 of the paper as an example, the lift degree is very high, so the association degree of the rule is very high. For example, if SIWU is added, there is no doubt that the three-item rule is a valid strong association rule with water supply. If R is added, although the association degree between R and WS is not strong, the FP and DWU have very strong association degree with water supply. So the addition of R is equivalent to adding some noise to the previous SAR, which will only have a weak impact on the analysis results. However, this result is not consistent with our original intention. Therefore, this study only focuses on the one-item SAR and two-item SAR, which can also reduce the calculation load of data mining. This part has been added in Section 4.2 (Experiment results). 15. 17. Some key parameters are not mentioned. The rationale on the choice of the particular set of parameters should be explained with more details. Have the authors experimented with other sets of values? What are the sensitivities of these parameters on the results? Answer: we agree to revise. A more detailed statement has been added to the paper. In this study, we had experimented with other sets of values, and the results showed that the rules of three-item and four-item set had certain association degree with WS. But this study only focuses on the one-item SAR and two-item SAR, which can also reduce the calculation load of data mining. The reasons had been stated in the question 16, these results are not consistent with our original intention, so this study did not focus on them. 16. 18. Some assumptions are stated in various sections. Justifications should be provided on these assumptions. Evaluation on how they will affect the results should be made. Answer: all the results in the paper are the real results obtained from data mining without assumptions. 17. 19. The discussion section in the present form is relatively weak and should be strengthened with more details and justifications. Answer: we agree to revise. The results and discussion have been strengthened. At the same time, the section 5 (Discussion) is added, which discusses the results of different confidence degree threshold. The support degree threshold set in this study is already relatively small, and even if the support degree threshold is set to a smaller value, it just generates more frequent items. However, the association rules still need to be filtered by confidence degree threshold, so the data mining results of smaller support degree threshold are not presented in this study. 18. 20. Moreover, the manuscript could be substantially improved by relying and citing more on recent literatures about contemporary real-life case studies of modelling and/or optimization techniques in water distribution systems such as the followings. Discussions about result comparison and/or incorporation of those concepts in your works are encouraged. Answer: we agree to revise. We have studied these papers carefully and decided to cite them. 19. 21. Some inconsistencies and minor errors that needed attention are: Replace “…So Compared with D = 4, D = 3 is considered to be…” with “…So compared with D = 4, D = 3 is considered to be…” in p.8 Answer: we agree to revise. This capital letter has been revised. 20. 22. In the conclusion section, the limitations of this study and suggested improvements of this work should be highlighted. Answer: we agree to revise. This part has been added in conclusions. For the comments of reviewer #2 21. 1. The research abstract seems almost vague as if it is a fragmented part of the research. The abstract should be reformulated to be easy to understand and expressive of the overall research content. Answer: we agree to revise. The abstract is rewritten and revised further. 22. 2. I think the introduction was written in a format that could be described as good, but it still did not live up to the preferences. It is better to divide the introduction to include the scientific background of the main topic and the methods of work used in the research. Answer: we agree to revise. The structure of the introduction was tweaked and rewritten.We clearly stated the importance and expected benefits of this study, analyzed the limitations of previous studies, and clearly stated the objectives of this study. 23. 3. Materials and methods. The situation of study area was included within the materials and methods section! It is not suitable to be placed in this section; it is better to be as an independent section. At last paragraph of the Materials and Methods, the author has stated “…”. This topic is not related directly to the study area but rather to the methods of work and data collection, so I would prefer to transfer it to suitable place. Answer: we agree to revise. The study area and data description will be placed in section 3 (Study area and data description), which is an independent section. The development tools of the algorithm are placed in section 2.3 (Coupling method), the statement of the coupling method. 24. 4.What colors (green, orange, and blue) mean in Fig. 2.? Answer: these three colors represent the value distribution of in the interval after discretization. It has been revised in the paper to add detailed description in section 2.1 (Discretization). 25. 5. I do not understand why the author resorts to drawing equations for some common topics, and if he referred to computer software programs, it would be better, for example “Spearman's rank correlation coefficient”. Answer: The algorithm and parameters in this study are obtained through previous data mining experience and a large number of experiments. The explanation and statement of the methods will to some extent weaken the importance of the topic. The algorithm, flow of methods and parameters are expressed by the equation more clearly and intuitively. Spearman's rank correlation coefficient is directly used in this study to calculate the monotone relationship of features, but this method is not developed in this study. Meanwhile, this method is so common that it doesn't need to be stated or explained. 26. 6. The scatter diagrams need to be explained how clarifies the relationships between different features. Answer: we agree to revise. The more detailed statement clarifying the relationships between different features has been added to section 4.1 (Exploratory data analysis). 27. 7. The expression 3D = 3 in section 2.2. Mining results and discussion: is not accepted to be written by this style, please. Answer: we agree to revise. But I don't understand the meaning of the sentence "The expression 3D = 3 in section 2.2.". I don't find the expression 3D=3 in section 2.2. Mining results and discussion: the title has been revised. 28. 8. The highest confidence degree in Table 3 is low (0.7)? Can such a result be taken into consideration? Answer: the rules in the table are sorted by lift degree, and the highest confidence degree is not 0.7. As long as the confidence degree of the rule is greater than the confidence degree threshold, they should all be considered. Finally, the rule must be validated by the lift degree. If the lift degree of the rule is greater than 1, the rule is considered to be valid. The association degree of rule is relevant depends on the lift degree. Therefore, as long as the rules meet the threshold conditions, they should all be considered. Self-revision The variable description (Table 1) is added in the section 3.2 (Data description). The discretization results of D = 3 and the discretization results of D = 4 are presented in two tables, respectively. The maximum and minimum values of the features are added to the value interval. In the process of repeatedly reading and revising the paper, we found some mistakes and some unclear expressions, and these problems have been revised. Submitted filename: Response to Reviewers.docx Click here for additional data file. 22 Jul 2021 The water supply association analysis method in Shenzhen based on kmeans clustering discretization and apriori algorithm PONE-D-21-16186R1 Dear Dr. Sang, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Zaher Mundher Yaseen Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: (No Response) Reviewer #2: Through my review of this round of scientific evaluation of the manuscript entitled “The water supply association analysis method in Shenzhen based on kmeans clustering discretization and apriori algorithm” and after reviewing the author's response to the comments of the reviewers. I found that the author has responded very well and submitted a satisfactory modified version that meets the research basics. The modified manuscript becomes more enhanced than before, and on this basis, Eventually, the author has adequately addressed the reviewer comments raised in a previous round of review and I feel the manuscript is now acceptable for publication. I here authorize the editor to accept it as my decision is to accept the research. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Salih Muhammad Awadh Submitted filename: Reviewer Report.docx Click here for additional data file. 27 Jul 2021 PONE-D-21-16186R1 The water supply association analysis method in Shenzhen based on kmeans clustering discretization and apriori algorithm Dear Dr. Sang: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Zaher Mundher Yaseen Academic Editor PLOS ONE

7 in total

1. Unsteady pressure patterns discovery from high-frequency sensing in water distribution systems.

Authors: Lu Xing; Lina Sela
Journal: Water Res Date: 2019-03-28 Impact factor: 11.236

2. A system dynamics urban water management model for Macau, China.

Authors: Tong Wei; Inchio Lou; Zhifeng Yang; Yingxia Li
Journal: J Environ Sci (China) Date: 2016-10-05 Impact factor: 5.565

3. Development of an ensemble of machine learning algorithms to model aerobic granular sludge reactors.

Authors: Mohamed Sherif Zaghloul; Oliver Terna Iorhemen; Rania Ahmed Hamza; Joo Hwa Tay; Gopal Achari
Journal: Water Res Date: 2020-11-19 Impact factor: 11.236

4. Water quality assessment with emphasis in parameter optimisation using pattern recognition methods and genetic algorithm.

Authors: Gonzalo Sotomayor; Henrietta Hampel; Raúl F Vázquez
Journal: Water Res Date: 2017-12-08 Impact factor: 11.236

5. The magnitude and drivers of harmful algal blooms in China's lakes and reservoirs: A national-scale characterization.

Authors: Jiacong Huang; Yinjun Zhang; George B Arhonditsis; Junfeng Gao; Qiuwen Chen; Jian Peng
Journal: Water Res Date: 2020-05-14 Impact factor: 11.236

6. A predictive model of recreational water quality based on adaptive synthetic sampling algorithms and machine learning.

Authors: Tingting Xu; Giovanni Coco; Martin Neale
Journal: Water Res Date: 2020-04-13 Impact factor: 11.236

7. Extracting Value from Industrial Alarms and Events: A Data-Driven Approach Based on Exploratory Data Analysis.

Authors: Aguinaldo Bezerra; Ivanovitch Silva; Luiz Affonso Guedes; Diego Silva; Gustavo Leitão; Kaku Saito
Journal: Sensors (Basel) Date: 2019-06-20 Impact factor: 3.576

7 in total

2 in total

1. Sports Economic Mining Algorithm Based on Association Analysis and Big Data Model.

Authors: Fujian Zhou
Journal: Comput Intell Neurosci Date: 2022-05-23

2. Analysis of Influencing Factors of College Students' Physical Exercise Habits Based on the Continuous Discrete Algorithm.

Authors: Zhijian Zhang; Miaomiao Jiang; Guanglong Shi; Shanshan Gao
Journal: J Environ Public Health Date: 2022-08-16

2 in total