| Literature DB >> 24790576 |
Wei Fang1, V S Sheng2, XueZhi Wen3, Wubin Pan3.
Abstract
In the atmospheric science, the scale of meteorological data is massive and growing rapidly. K-means is a fast and available cluster algorithm which has been used in many fields. However, for the large-scale meteorological data, the traditional K-means algorithm is not capable enough to satisfy the actual application needs efficiently. This paper proposes an improved MK-means algorithm (MK-means) based on MapReduce according to characteristics of large meteorological datasets. The experimental results show that MK-means has more computing ability and scalability.Entities:
Mesh:
Year: 2014 PMID: 24790576 PMCID: PMC3953661 DOI: 10.1155/2014/646497
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Figure 1Google's MapReduce programming model.
Algorithm 1
Figure 2An example of the execution process of MapReduce, including the intermediate results of each step.
Figure 3The clustering process of K-means.
Figure 4The workflow of the Parallel K-means with MapReduce.
Algorithm 2
Algorithm 3
Algorithm 4
Algorithm 5Experimental datasets.
| Dataset | File name | Capacity | Matrix | Type |
|---|---|---|---|---|
| 1 | dataset1.txt | 250 M | 2.5∗106∗26 | 1 year dataset |
| 2 | dataset2.txt | 500 M | 5∗106∗26 | 2 years dataset |
| 3 | dataset3.txt | 1 G | 1∗107∗26 | 3 years dataset |
| 4 | dataset4.txt | 2 G | 2∗107∗26 | 4 years dataset |
Figure 5The example of the status of our Hadoop cluster.
The configuration items of Hadoop key parameters.
| Configuration parameter name | Parameter value | Description |
|---|---|---|
| io.sort.mb | 256 | Maximum Memory to store temporary data in the phase of arrangement, overflow to the disk if excess, unit: M |
| dfs.replication | 3 | Number of file backup |
| dfs.block.size | 409600 | The maximum value of each file: the file is read and stored in block if excess unit: bit |
| mapred.local.dir | /mapred/local | Data stored path when MapReduce task executes |
| mapred.tasktracker. | 2 | The maximum number of Map tasks can be run on a TaskTracker; these tasks run at the same time |
| mapred.tasktracker. | 1 | The maximum number of Reduce tasks can be run on a TaskTracker; these tasks run at the same time |
| mapred.reduce. | 30 | Reduce startup more parallel copies for a large number of output map |
| io.sort.factor | 100 | More streams will be merged while sorting files |
| fs.default.name | hdfs://aiken:9000 | The host IP and port of JobTracker |
| hadoop.tmp.dir | /root/data1 | Hadoop default temporary path |
The results of MK-means.
| Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | |
|---|---|---|---|---|---|
| 20-20-hour precipitation (0.1 mm) | 23 | 19 | 18 | 19 | 455 |
| Average site pressure (0.1 hPa) | 8364 | 9183 | 9966 | 6608 | 9876 |
| Average wind speed (0.1 m/s) | 23 | 20 | 20 | 22 | 20 |
| Average temperature (0.1°C) | 124 | 129 | 157 | 55 | 229 |
| Average vapor pressure (0.1 hPa) | 97 | 112 | 156 | 56 | 252 |
| Average relative humidity (1%) | 58 | 61 | 70 | 54 | 89 |
| Sunshine hours (0.1 hour) | 71 | 67 | 55 | 71 | 14 |
| Minimum temperature (0.1°C) | 72 | 76 | 118 | −3 | 208 |
| Maximum temperature (0.1°C) | 188 | 193 | 206 | 129 | 265 |
Figure 6Test results of different datasets.
Figure 7System evaluation results.
Figure 8The square error of clustering results.
Throughput of the two different types of datasets.
| File size | Block size | Test set | Write (MB/s) | Read (MB/s) |
|---|---|---|---|---|
| 230 MB | 64 MB | 5 | 2.34 | 0.44 |
| 6 | 5.17 | 10.34 |