| Literature DB >> 19451100 |
Yunhong Gu1, Robert L Grossman.
Abstract
Cloud computing has demonstrated that processing very large datasets over commodity clusters can be done simply, given the right programming model and infrastructure. In this paper, we describe the design and implementation of the Sector storage cloud and the Sphere compute cloud. By contrast with the existing storage and compute clouds, Sector can manage data not only within a data centre, but also across geographically distributed data centres. Similarly, the Sphere compute cloud supports user-defined functions (UDFs) over data both within and across data centres. As a special case, MapReduce-style programming can be implemented in Sphere by using a Map UDF followed by a Reduce UDF. We describe some experimental studies comparing Sector/Sphere and Hadoop using the Terasort benchmark. In these studies, Sector is approximately twice as fast as Hadoop. Sector/Sphere is open source.Entities:
Year: 2009 PMID: 19451100 PMCID: PMC3391065 DOI: 10.1098/rsta.2009.0053
Source DB: PubMed Journal: Philos Trans A Math Phys Eng Sci ISSN: 1364-503X Impact factor: 4.226
Figure 1The Sector system architecture.
Figure 2The computing paradigm of Sphere.
Figure 3Sorting large distributed datasets with Sphere.
Figure 4File downloading performance on the Teraflow Testbed.
Figure 5Performance of SDSS distribution to end users.
The Terasort benchmark for Sector/Sphere and Hadoop. (All times are in seconds.)
| Sector/Sphere | Hadoop three replicas | Hadoop one replica | |
|---|---|---|---|
| UIC (one location, 30 nodes) | 1265 | 2889 | 2252 |
| UIC+StarLight (two locations, 60 nodes) | 1361 | 2896 | 2617 |
| UIC+StarLight+Calit2 (three locations, 90 nodes) | 1430 | 4341 | 3069 |
| UIC+StarLight+Calit2+JHU (four locations, 120 nodes) | 1526 | 6675 | 3702 |
A summary of some of the differences between Sector/Sphere and GFS/BigTable and Hadoop.
| design decision | GFS, BigTable | Hadoop | Sector/Sphere |
|---|---|---|---|
| datasets divided into files or into blocks | blocks | blocks | files |
| protocol for message passing within the system | TCP | TCP | group messaging protocol |
| protocol for transferring data | TCP | TCP | UDP-based data transport |
| programming model | MapReduce | MapReduce | user-defined functions applied to segments |
| replication strategy | replicas created at the time of writing | replicas created at the time of writing | replicas created periodically by system |
| support high-volume inflows and outflows | no | no | yes, using UDT |
| security model | not mentioned | none | user-level and file-level access controls |
| language | C++ | Java | C++ |