| Literature DB >> 26305223 |
Muhammad Idris1, Shujaat Hussain1, Muhammad Hameed Siddiqi1, Waseem Hassan2, Hafiz Syed Muhammad Bilal1, Sungyoung Lee1.
Abstract
Large quantities of data have been generated from multiple sources at exponential rates in the last few years. These data are generated at high velocity as real time and streaming data in variety of formats. These characteristics give rise to challenges in its modeling, computation, and processing. Hadoop MapReduce (MR) is a well known data-intensive distributed processing framework using the distributed file system (DFS) for Big Data. Current implementations of MR only support execution of a single algorithm in the entire Hadoop cluster. In this paper, we propose MapReducePack (MRPack), a variation of MR that supports execution of a set of related algorithms in a single MR job. We exploit the computational capability of a cluster by increasing the compute-intensiveness of MapReduce while maintaining its data-intensive approach. It uses the available computing resources by dynamically managing the task assignment and intermediate data. Intermediate data from multiple algorithms are managed using multi-key and skew mitigation strategies. The performance study of the proposed system shows that it is time, I/O, and memory efficient compared to the default MapReduce. The proposed approach reduces the execution time by 200% with an approximate 50% decrease in I/O cost. Complexity and qualitative results analysis shows significant performance improvement.Entities:
Mesh:
Year: 2015 PMID: 26305223 PMCID: PMC4549337 DOI: 10.1371/journal.pone.0136259
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1MRPack Architecture: The Map and Reduce functions are extended to In-Map and In-Reduce, which function for each algorithm under the umbrella of a single job.
Fig 2Composite Key Structure: This structure shows keys modeling in MRPack where it is used to differentiate the algorithms.
Fig 3The basic data flow of MRPack based on two algorithms, WordCount and InvertedIndex.
Fig 4Cluster-based analysis of MRPack performance compared to that of generic MapReduce.
Fig 5Analysis based on changing data size w.r.t overall job execution time.
Fig 6Analysis based on number of algorithms in an MRPack job with respect to time execution time in terms of I/O and network communication.