| Literature DB >> 27589753 |
Qi Liu1,2, Weidong Cai3, Dandan Jin4, Jian Shen5, Zhangjie Fu6, Xiaodong Liu7, Nigel Linge8.
Abstract
Distributed Computing has achieved tremendous development since cloud computing was proposed in 2006, and played a vital role promoting rapid growth of data collecting and analysis models, e.g., Internet of things, Cyber-Physical Systems, Big Data Analytics, etc. Hadoop has become a data convergence platform for sensor networks. As one of the core components, MapReduce facilitates allocating, processing and mining of collected large-scale data, where speculative execution strategies help solve straggler problems. However, there is still no efficient solution for accurate estimation on execution time of run-time tasks, which can affect task allocation and distribution in MapReduce. In this paper, task execution data have been collected and employed for the estimation. A two-phase regression (TPR) method is proposed to predict the finishing time of each task accurately. Detailed data of each task have drawn interests with detailed analysis report being made. According to the results, the prediction accuracy of concurrent tasks' execution time can be improved, in particular for some regular jobs.Entities:
Keywords: MapReduce; cloud computing; data analysis; data convergence; speculative execution
Year: 2016 PMID: 27589753 PMCID: PMC5038664 DOI: 10.3390/s16091386
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Implementation detail.
Figure 2Data collected during running WordCount on the same node. Different groups of data generated from four map tasks running on the same node when executing WordCount: (a) Group 1; (b) Group 2; (c) Group 3; and (d) Group 4.
Figure 3Data collected during running Sort on the same node. Different groups of data generated from four map tasks running on the same node when executing Sort: (a) Group 1; (b) Group 2; (c) Group 3; and (d) Group 4.
The detailed information of each virtual machine.
| Node ID | Memory (GB) | Core Processors |
|---|---|---|
| Node 1 | 10 | 8 |
| Node 2 | 8 | 4 |
| Node 3 | 8 | 1 |
| Node 4 | 8 | 8 |
| Node 5 | 4 | 8 |
| Node 6 | 4 | 4 |
| Node 7 | 18 | 4 |
| Node 8 | 12 | 8 |
The detailed volume of input data and collected data.
| Application | Input Data(GB) | Data Volume Collected (Groups) |
|---|---|---|
| Sort | 30 | 326 |
| WordCount | 50 | 727 |
Data of Map Tasks Collected by Modified Hadoop.
| File Name | Size |
|---|---|
| attempt_1460032292591_0001_m_000000_0_ MAP | 588 B |
| attempt_1460032292591_0002_m_000000_0_ MAP | 903 B |
| attempt_1460032292591_0003_m_000000_0_ MAP | 1.46 KB |
| attempt_1460032292591_0004_m_000000_0_ MAP | 840 B |
| attempt_1460032292591_0005_m_000000_0_ MAP | 1.23 KB |
| attempt_1460032292591_0006_m_000000_0_ MAP | 1.33 KB |
| attempt_1460032292591_0007_m_000000_0_ MAP | 861 B |
| attempt_1460032292591_0008_m_000000_0_ MAP | 630 B |
| … | … |
Data Structure.
| Progress | Timestamp |
|---|---|
| 0.068 | 1460095563715 |
| 0.112 | 1460095567625 |
| 0.202 | 1460095587516 |
| 0.231 | 1460095607664 |
| 0.240 | 1460095611976 |
| 0.249 | 1460095633288 |
| 0.265 | 1460095633289 |
| 0.292 | 1460095633295 |
| 0.305 | 1460094649633 |
| … | … |
Figure 4Linear Regression calculated by direct regression algorithm during a lifetime. (a) Group 1; (b) Group 2; (c) Group 3; and (d) Group 4 data and their values calculated by Linear Regression for WordCount sample.
Figure 5Linear Regression calculated by direct regression algorithm during a lifetime. (a) Group 1; (b) Group 2; (c) Group 3; and (d) Group 4 data and their values calculated by Linear Regression for Sort sample.
Error evaluation during training phase for WordCount sample.
| RMSE (S) | 1.2 | 1.6 | 2.8 | 1.5 | 1.6 | 1.4 | 1.3 | 1.4 | |
| MAPE (%) | 1.7 | 2.8 | 2.8 | 1.7 | 2.4 | 2.7 | 1.6 | 3.1 | |
| RMSE (S) | 2.4 | 1.4 | 1.0 | 1.9 | 1.7 | 1.7 | 1.3 | 1.2 | |
| MAPE (%) | 3.3 | 2.4 | 1.1 | 5.3 | 5.5 | 2.4 | 1.7 | 2.0 | |
| RMSE (S) | 0.9 | 1.9 | 1.2 | 1.5 | 2.1 | 1.4 | 1.1 | 1.3 | 1.7 |
| MAPE (%) | 1.7 | 3.8 | 1.3 | 2.2 | 4.9 | 3.0 | 2.0 | 2.4 | 2.8 |
Error evaluation during training phase for Sort sample.
| RMSE (S) | 1.6 | 1.5 | 1.4 | 1.1 | 1.6 | 2.4 | 1.7 | 1.5 | |
| MAPE (%) | 1.4 | 1.2 | 1.0 | 0.8 | 1.3 | 7.3 | 1.8 | 1.4 | |
| RMSE (S) | 1.6 | 1.6 | 3.1 | 1.6 | 2.4 | 2.4 | 0.9 | 2.1 | |
| MAPE (%) | 1.5 | 2 | 8 | 1.7 | 3.9 | 7.9 | 1.8 | 2.7 | |
| RMSE (S) | 1.7 | 0.6 | 2.3 | 1.6 | 1.3 | 1.8 | 1.4 | 1.6 | 1.5 |
| MAPE (%) | 1.8 | 6.8 | 4.6 | 1.5 | 0.9 | 2.8 | 1.5 | 2.1 | 2.7 |
Figure 6Accuracy with different value of .
Error evaluation in predicting phase for WordCount sample.
| RMSE (S) | 1.8 | 3.5 | 5.7 | 2.7 | 3.6 | 2.8 | 3.5 | 1.9 | ||
| MAPE (%) | 1.7 | 5.6 | 4.9 | 3.3 | 5.6 | 3.5 | 5.1 | 2.4 | ||
| RMSE (S) | 10.9 | 6.7 | 3.7 | 5.2 | 2.7 | 4.4 | 3.1 | 1.9 | ||
| MAPE (%) | 21 | 17.2 | 7.4 | 10.2 | 4.7 | 6.2 | 4.3 | 2.2 | ||
| RMSE (S) | 3.1 | 7.4 | 3.2 | 10.0 | 11.4 | 7.4 | 9.4 | 1.8 | 4.9 | 3.6 |
| MAPE (%) | 6.0 | 18.5 | 4.0 | 18.9 | 33.5 | 20.8 | 26.5 | 1.5 | 9.8 | 5.5 |
Error evaluation in predicting phase for Sort sample.
| RMSE (S) | 3.4 | 4.4 | 4.2 | 4.1 | 2.8 | 9.4 | 3.4 | 4.1 | ||
| MAPE (%) | 2.7 | 3.7 | 4.5 | 5.4 | 2 | 78.0 | 3.6 | 4.9 | ||
| RMSE (S) | 4.3 | 3.7 | 9.3 | 7.8 | 10.9 | 3.7 | 13.6 | 5.0 | ||
| MAPE (%) | 3.4 | 3.3 | 40.3 | 12 | 28.0 | 7.7 | 12.3 | 10.3 | ||
| RMSE (S) | 3.8 | 7.6 | 10.4 | 3.7 | 9.4 | 6.7 | 10.8 | 3.7 | 6.3 | 4.8 |
| MAPE (%) | 3.5 | 17.3 | 25.4 | 2.8 | 11 | 11.2 | 25.8 | 4.2 | 13.5 | 6.1 |
Figure 7Two-Phase Regression values. Error obtained by Abs (Regression Value–Actual value). (a) Group 1; (b) Group 2; (c) Group 3; and (d) Group 4 data and their values calculated by Two-Phase Regression for WordCount sample.
Figure 8Two-Phase Regression values. Error obtained by Abs (Regression Value–Actual value). (a) Group 1; (b) Group 2; (c) Group 3; and (d) Group 4 data and their values calculated by Two-Phase Regression for Sort sample.