| Literature DB >> 34141897 |
Elham Azhir1, Nima Jafari Navimipour2, Mehdi Hosseinzadeh3, Arash Sharifi1, Aso Darwesh4.
Abstract
Query optimization is the process of identifying the best Query Execution Plan (QEP). The query optimizer produces a close to optimal QEP for the given queries based on the minimum resource usage. The problem is that for a given query, there are plenty of different equivalent execution plans, each with a corresponding execution cost. To produce an effective query plan thus requires examining a large number of alternative plans. Access plan recommendation is an alternative technique to database query optimization, which reuses the previously-generated QEPs to execute new queries. In this technique, the query optimizer uses clustering methods to identify groups of similar queries. However, clustering such large datasets is challenging for traditional clustering algorithms due to huge processing time. Numerous cloud-based platforms have been introduced that offer low-cost solutions for the processing of distributed queries such as Hadoop, Hive, Pig, etc. This paper has applied and tested a model for clustering variant sizes of large query datasets parallelly using MapReduce. The results demonstrate the effectiveness of the parallel implementation of query workloads clustering to achieve good scalability. ©2021 Azhir et al.Entities:
Keywords: Access plan recommendation; Cluster computing; DBSCAN Algorithm; MapReduce; Parallel Processing; Query optimization
Year: 2021 PMID: 34141897 PMCID: PMC8176525 DOI: 10.7717/peerj-cs.580
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 1Data flow in MapReduce.
Figure 2The access plan recommendation approach.
Figure 3The parallel access plan recommendation using MapReduce flow.
The Hadoop cluster setup.
| Nodes | OS | Configurations | No. of cores | Processor base frequency | Configuration |
|---|---|---|---|---|---|
| Master machine (Name node) | Red Hat (64bit)-Linux | 8 GB RAM | 4 | 2.53 GHZ | Intel Core i7 |
| Slave machines (Data nodes) | Red Hat (64bit)-Linux | 3 GB RAM | 2 | 2.53 GHZ | Intel Core i7 |
| Hadoop version | Hadoop-2.8.0 | – | |||
| Virtual machine management | Virtualbox 6.1.16 |
The description of datasets.
| Features | Dataset name | No. of classes | No. of individuals |
|---|---|---|---|
| Selection | S1 | 3 | 95 |
| S2 | 7 | 235 | |
| S3 | 10 | 389 | |
| S4 | 12 | 481 | |
| S5 | 14 | 593 | |
| Selection/join, from, group-by and order-by | SJ1 | 4 | 60 |
| SJ2 | 7 | 199 | |
| SJ3 | 10 | 389 | |
| SJ4 | 12 | 481 | |
| SJ5 | 14 | 593 | |
| Selection/join, from, group-by and order-by alongsidenested sub-queries | SJN1 | 4 | 108 |
| SJN2 | 6 | 140 | |
| SJN3 | 9 | 325 | |
| SJN4 | 11 | 440 | |
| SJN5 | 14 | 593 |
Figure 4The clustering quality validation for the selection queries datasets (S1–S5).
(A) Dunn index; (B) Silhouette coefficient.
Figure 6The clustering quality validation factor for the selection/join/nested sub-queries datasets (SJN1–SJN5).
(A) Dunn index; (B) Silhouette coefficient.
Figure 7MapReduce processing time variation for the selection queries’ datasets (S1–S5).
Figure 8The clustering quality validation using ARI.
Epsilon and Minpts best values for the proposed algorithm.
| Dataset | ARI | No. of Clusters | Minpts | Eps |
|---|---|---|---|---|
| S1 | 0.6997 | 3 | 8 | 0.03 |
| S2 | 0.7504 | 7 | 16 | 0.1 |
| S3 | 0.6956 | 10 | 16 | 0.08 |
| S4 | 0.7348 | 12 | 16 | 0.12 |
| S5 | 0.7011 | 14 | 16 | 0.12 |
| SJ1 | 0.4905 | 1 | 8 | 0.06 |
| SJ2 | 0.7361 | 5 | 8 | 0.06 |
| SJ3 | 0.4859 | 12 | 8 | 0.09 |
| SJ4 | 0.6084 | 12 | 8 | 0.06 |
| SJ5 | 0.5062 | 14 | 8 | 0.09 |
| SJN1 | 0.0310 | 2 | 12 | 0.06 |
| SJN2 | 0.0923 | 2 | 12 | 0.06 |
| SJN3 | 0.0482 | 3 | 12 | 0.02 |
| SJN4 | 0.0310 | 5 | 12 | 0.06 |
| SJN5 | 0.0324 | 7 | 12 | 0.06 |