| Literature DB >> 26191487 |
Dilpreet Singh1, Chandan K Reddy1.
Abstract
The primary purpose of this paper is to provide an in-depth analysis of different platforms available for performing big data analytics. This paper surveys different hardware platforms available for big data analytics and assesses the advantages and drawbacks of each of these platforms based on various metrics such as scalability, data I/O rate, fault tolerance, real-time processing, data size supported and iterative task support. In addition to the hardware, a detailed description of the software frameworks used within each of these platforms is also discussed along with their strengths and drawbacks. Some of the critical characteristics described here can potentially aid the readers in making an informed decision about the right choice of platforms depending on their computational needs. Using a star ratings table, a rigorous qualitative comparison between different platforms is also discussed for each of the six characteristics that are critical for the algorithms of big data analytics. In order to provide more insights into the effectiveness of each of the platform in the context of big data analytics, specific implementation level details of the widely used k-means clustering algorithm on various platforms are also described in the form pseudocode.Entities:
Keywords: Big data; MapReduce; big data analytics; big data platforms; graphics processing units; k-means clustering; real-time processing; scalability
Year: 2014 PMID: 26191487 PMCID: PMC4505391 DOI: 10.1186/s40537-014-0008-6
Source DB: PubMed Journal: J Big Data ISSN: 2196-1115
A comparison of advantages and drawbacks of horizontal and vertical scaling
|
|
|
|
|---|---|---|
|
| ➔ Increases performance in small steps as needed | ➔ Software has to handle all the data distribution and parallel processing complexities |
| ➔ Financial investment to upgrade is relatively less | ➔ Limited number of software are available that can take advantage of horizontal scaling | |
| ➔ Can scale out the system as much as needed | ||
|
| ➔ Most of the software can easily take advantage of vertical scaling | ➔ Requires substantial financial investment |
| ➔ Easy to manage and install hardware within a single machine | ➔ System has to be more powerful to handle future workloads and initially the additional performance in not fully utilized | |
| ➔ It is not possible to scale up vertically after a certain limit |
Figure 1Hadoop Stack showing different components.
Figure 2An illustration of Berkeley Data Analysis Stack and its various components [ 20 ].
Figure 3A comparison between the architectures of CPU(a) and GPU(b) showing the arrangement of processing cores.
Comparison of different platforms (along with their communication mechanisms) based on various characteristics
|
|
|
|
| ||||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| ||
| Horizontal scaling | Peer-to-Peer (TCP/IP) | ★★★★★ | ★ | ★ | ★ | ★★★★★ | ★★ |
| Virtual clusters (MapRedce/MPI) | ★★★★★ | ★★ | ★★★★★ | ★★ | ★★★★ | ★★ | |
| Virtual clusters (Spark) | ★★★★★ | ★★★ | ★★★★★ | ★★ | ★★★★ | ★★★ | |
| Vertical scaling | HPC clusters (MPI/Mapreduce) | ★★★ | ★★★★ | ★★★★ | ★★★ | ★★★★ | ★★★★ |
| Multicore (Multithreading) | ★★ | ★★★★ | ★★★★ | ★★★ | ★★ | ★★★★ | |
| GPU (CUDA) | ★★ | ★★★★★ | ★★★★ | ★★★★★ | ★★ | ★★★★ | |
| FPGA (HDL) | ★ | ★★★★★ | ★★★★ | ★★★★★ | ★★ | ★★★★ | |
Figure 4The pseudocode of the K-means clustering algorithm.
Figure 5Pseudocode of MapReduce based K-means clustering algorithm. The first part shows the map function and the second part shows the reduce function.
Figure 6Pseudocode of k-means clustering algorithm using MPI in a master–slave configuration.
Figure 7The pseudocode of the K-means clustering algorithm on GPU.