| Literature DB >> 29065568 |
Yufei Gao1, Yanjie Zhou2, Bing Zhou3, Lei Shi4, Jiacai Zhang1,5.
Abstract
The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data processing algorithm called Partition Tuning-based Skew Handling (PTSH). In comparison with the one-stage partitioning strategy used in the traditional MapReduce model, PTSH uses a two-stage strategy and the partition tuning method to disperse key-value pairs in virtual partitions and recombines each partition in case of data skew. The robustness and efficiency of the proposed algorithm were tested on a wide variety of simulated datasets and real healthcare datasets. The results showed that PTSH algorithm can handle data skew in MapReduce efficiently and improve the performance of MapReduce jobs in comparison with the native Hadoop, Closer, and locality-aware and fairness-aware key partitioning (LEEN). We also found that the time needed for rule extraction can be reduced significantly by adopting the PTSH algorithm, since it is more suitable for association rule mining (ARM) on healthcare data.Entities:
Mesh:
Year: 2017 PMID: 29065568 PMCID: PMC5415866 DOI: 10.1155/2017/1425102
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 2.682
Figure 1MapReduce programing model.
Figure 2Acquisition of the metadata for reduce tasks.
Algorithm 1Repartitioning algorithm.
Application characteristics.
| Application | Data type | Input data size (GB) | Frequency of variation | Average variation in | Method |
|---|---|---|---|---|---|
| II-1 | Wikipedia | 4 | 61% | 33% | Hadoop, PTSH |
| II-2 | Wikipedia | 4 | 156% | 108% | Hadoop, PTSH |
| WC-1 | RandomWriter | 7.5 | 42% | 136% | Hadoop, PTSH |
| WC-2 | RandomWriter | 7.5 | 125% | 211% | Hadoop, PTSH |
| WC-3 | RandomWriter | 7.5 | 116% | 130% | Hadoop, Closer, LEEN, PTSH |
Figure 3Performance of II and WC with different variations in the frequency of keys as well as different key distributions.
The detailed performance of the nodes with maximum and minimum load.
| Application | Method | Node with maximum load | Node with minimum load | ||
|---|---|---|---|---|---|
| Size (MB) | Runtime (seconds) | Size (MB) | Runtime (seconds) | ||
| II-1 | Hadoop | 292 | 75 | 27 | 8 |
| PTSH | 187 | 49 | 122 | 33 | |
|
| |||||
| II-2 | Hadoop | 329 | 92 | 15 | 5 |
| PTSH | 205 | 66 | 107 | 36 | |
|
| |||||
| WC-1 | Hadoop | 391 | 154 | 43 | 21 |
| PTSH | 278 | 106 | 230 | 82 | |
|
| |||||
| WC-2 | Hadoop | 425 | 180 | 12 | 3 |
| PTSH | 291 | 119 | 195 | 71 | |
Figure 4Performance of WC-3 with native Hadoop, Closer, LEEN, and PTSH.
Data locality and Cov of each method.
| Method | Cov | Locality range |
|---|---|---|
| Hadoop | 79% | 3% |
| Closer | 23% | 1–12% |
| LEEN | 15% | 1–16% |
| PTSH | 11% | 1–14% |
Dataset characteristics.
| Attribute characteristic | Transaction | Numbers | Numbers |
|---|---|---|---|
| Categorical, integer | 3.31 | 4,905,142 | 247 |
Association rule parameters.
| Parameter name | Parameter value |
|---|---|
| Minimum confidence | 0.6 |
| Minimum support | 0.2 |
| Minimum interest | 0.3 |
Performance of IM-Apriori algorithm.
| Transaction | Number | Runtime | Runtime | Runtime | |
|---|---|---|---|---|---|
| ARM-1 | 0.72 | 10 | 914 | 795 | 706 |
| ARM-2 | 1.68 | 20 | 1522 | 1147 | 1058 |
| ARM-3 | 3.31 | 40 | 2359 | 1962 | 1632 |
Figure 5Performance of data locality, Cov, and the data size of nodes with maximum and minimum load for the three tests.
Figure 6Detailed performance of each stage and tasks in Hadoop, Closer, and PTSH for ARM-3.
Association rules for NSDUH (2004–2014).
| No. | Rules | Confidence |
|---|---|---|
| 1 | Age = young adult and | 0.72 |
| 2 | Gender = male and marital | 0.65 |
| 3 | Job status = unemployment and | 0.63 |