| Literature DB >> 34421217 |
Jimmy Ming-Tai Wu1, Min Wei1, Mu-En Wu2, Shahab Tayeb3.
Abstract
Top-k dominating (TKD) query is one of the methods to find the interesting objects by returning the k objects that dominate other objects in a given dataset. Incomplete datasets have missing values in uncertain dimensions, so it is difficult to obtain useful information with traditional data mining methods on complete data. BitMap Index Guided Algorithm (BIG) is a good choice for solving this problem. However, it is even harder to find top-k dominance objects on incomplete big data. When the dataset is too large, the requirements for the feasibility and performance of the algorithm will become very high. In this paper, we proposed an algorithm to apply MapReduce on the whole process with a pruning strategy, called Efficient Hadoop BitMap Index Guided Algorithm (EHBIG). This algorithm can realize TKD query on incomplete datasets through BitMap Index and use MapReduce architecture to make TKD query possible on large datasets. By using the pruning strategy, the runtime and memory usage are greatly reduced. What's more, we also proposed an improved version of EHBIG (denoted as IEHBIG) which optimizes the whole algorithm flow. Our in-depth work in this article culminates with some experimental results that clearly show that our proposed algorithm can perform well on TKD query in an incomplete large dataset and shows great performance in a Hadoop computing cluster.Entities:
Keywords: Big data framework; Dominance relationship; Incomplete data; MapReduce; Top-textitk dominating query
Year: 2021 PMID: 34421217 PMCID: PMC8369331 DOI: 10.1007/s11227-021-04005-x
Source DB: PubMed Journal: J Supercomput ISSN: 0920-8542 Impact factor: 2.474
Example of a movie recommendation system
| ID | Movie Name | a1 | a2 | a3 | a4 | a5 |
|---|---|---|---|---|---|---|
| m1 | Schindler’s List (1993) | – | – | 3 | 4 | 2 |
| m2 | The Godfather (1972) | 5 | 2 | 1 | – | – |
| m3 | The Silence of Lambs (1991) | – | 3 | 4 | 5 | 3 |
| m4 | Star Wars (1977) | 3 | 1 | 5 | 3 | 4 |
Fig. 1An sample incomplete dataset
A sample incomplete dataset table
| object | ||||
|---|---|---|---|---|
| – | 1 | 2 | – | |
| 1 | – | 3 | 2 | |
| 3 | 1 | – | – | |
| – | - | – | 1 | |
| – | 2 | 1 | – | |
| – | 2 | – | 3 | |
| 1 | 1 | – | – | |
| – | 3 | 2 | – | |
| 2 | – | 2 | 2 | |
| 3 | 3 | – | – |
A sample BitMap indexing table
| Items | – | 1 | 2 | 3 | – | 1 | 2 | 3 | – | 1 | 2 | 3 | – | 1 | 2 | 3 | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| – | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 2 | 1 | 1 | 0 | 0 | – | 1 | 1 | 1 | 1 | |
| 1 | 1 | 0 | 0 | 0 | – | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 0 | 2 | 1 | 1 | 0 | 0 | |
| 3 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | – | 1 | 1 | 1 | 1 | – | 1 | 1 | 1 | 1 | |
| – | 1 | 1 | 1 | 1 | – | 1 | 1 | 1 | 1 | – | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | |
| – | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | – | 1 | 1 | 1 | 1 | |
| – | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 0 | 0 | – | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 0 | |
| 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | – | 1 | 1 | 1 | 1 | – | 1 | 1 | 1 | 1 | |
| – | 1 | 1 | 1 | 1 | 3 | 1 | 1 | 1 | 0 | 2 | 1 | 1 | 0 | 0 | – | 1 | 1 | 1 | 1 | |
| 2 | 1 | 1 | 0 | 0 | – | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 0 | 0 | 2 | 1 | 1 | 0 | 0 | |
| 3 | 1 | 1 | 1 | 0 | 3 | 1 | 1 | 1 | 0 | – | 1 | 1 | 1 | 1 | – | 1 | 1 | 1 | 1 |
Fig. 2The containment relationships
Fig. 3The algorithm framework of MRBIG
Fig. 4The algorithm framework of EHBIG
Fig. 5The algorithm flow of EHBIG
Fig. 6The algorithm flow of IEHBIG
A sample original dataset
| User | Movie | Rating |
|---|---|---|
| 1 | 01 | 3 |
| 1 | 02 | 4 |
| 1 | 04 | 2 |
| 2 | 02 | 2 |
| 2 | 03 | 3 |
| 3 | 04 | 3 |
| 4 | 01 | 1 |
| 4 | 04 | 2 |
Fig. 7The runtime of algorithms processing datasets with different sizes
Fig. 8The runtime of algorithms testing different parameter combinations
Fig. 9The runtime of algorithms using different missing rates
Fig. 10The memory needed in algorithms
Fig. 11The runtime of algorithms with different k