| Literature DB >> 25136682 |
Nawsher Khan1, Ibrar Yaqoob2, Ibrahim Abaker Targio Hashem2, Zakira Inayat3, Waleed Kamaleldin Mahmoud Ali2, Muhammad Alam4, Muhammad Shiraz2, Abdullah Gani2.
Abstract
Big Data has gained much attention from the academia and the IT industry. In the digital and computing world, information is generated and collected at a rate that rapidly exceeds the boundary range. Currently, over 2 billion people worldwide are connected to the Internet, and over 5 billion individuals own mobile phones. By 2020, 50 billion devices are expected to be connected to the Internet. At this point, predicted data production will be 44 times greater than that in 2009. As information is transferred and shared at light speed on optic fiber and wireless networks, the volume of data and the speed of market growth increase. However, the fast growth rate of such large data generates numerous challenges, such as the rapid growth of data, transfer speed, diverse data, and security. Nonetheless, Big Data is still in its infancy stage, and the domain has not been reviewed in general. Hence, this study comprehensively surveys and classifies the various attributes of Big Data, including its nature, definitions, rapid growth rate, volume, management, analysis, and security. This study also proposes a data life cycle that uses the technologies and terminologies of Big Data. Future research directions in this field are determined based on opportunities and several open issues in Big Data domination. These research directions facilitate the exploration of the domain and the development of optimal techniques to address Big Data.Entities:
Mesh:
Year: 2014 PMID: 25136682 PMCID: PMC4127205 DOI: 10.1155/2014/712826
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Figure 1Challenges in Big Data [13].
Rapid growth of unstructured data.
| Source | Production |
|---|---|
| YouTube [ | (i) Users upload 100 hours of new videos per minute |
|
| |
| Facebook [ | (i) Every minute, 34,722 Likes are registered |
|
| |
| Twitter [ | (i) The site has over 645 million users |
|
| |
| Foursquare [ | (i) This site is used by 45 million people worldwide |
|
| |
| Google+ [ | 1 billion accounts have been created |
|
| |
| Google [ | The site gets over 2 million search queries per minute |
|
| |
| Apple [ | Approximately 47,000 applications are downloaded per minute |
|
| |
| Brands [ | More than 34,000 Likes are registered per minute |
|
| |
| Tumblr [ | Blog owners publish 27,000 new posts per minute |
|
| |
| Instagram [ | Users share 40 million photos per day |
|
| |
| Flickr [ | Users upload 3,125 new photos per minute |
|
| |
| LinkedIn [ | 2.1 million groups have been created |
|
| |
| WordPress [ | Bloggers publish near 350 new blogs per minute |
Figure 2Worldwide shipment of HDDs from 1976 to 2013.
Figure 3Hadoop ecosystem.
Hadoop components and their functionalities.
| Hadoop component | Functions |
|---|---|
| (1) HDFS | Storage and replication |
| (2) MapReduce | Distributed processing and fault tolerance |
| (3) HBASE | Fast read/write access |
| (4) HCatalog | Metadata |
| (5) Pig | Scripting |
| (6) Hive | SQL |
| (7) Oozie | Workflow and scheduling |
| (8) ZooKeeper | Coordination |
| (9) Kafka | Messaging and data integration |
| (10) Mahout | Machine learning |
Hadoop usage.
| Specified use | Used by |
|---|---|
| (1) Searching | Yahoo, Amazon, Zvents |
| (2) Log processing | Facebook, Yahoo, ContexWeb.Joost, Last.fm |
| (3) Analysis of videos and images | New York Times, Eyelike |
| (4) Data warehouse | Facebook, AOL |
| (5) Recommendation systems |
Figure 4System architectures of MapReduce and HDFS.
MapReduce tasks.
| Steps | Tasks |
|---|---|
| (1) Input | (i) Data are loaded into HDFS in blocks and distributed to data nodes |
|
| |
| (2) Job | Submits the job and its details to the Job Tracker |
|
| |
| (3) Job initialization | (i) The Job Tracker interacts with the Task Tracker on each data node |
|
| |
| (4) Mapping | (i) The Mapper processes the data blocks |
|
| |
| (5) Sorting | The Mapper sorts the list of key value pairs |
|
| |
| (6) Shuffling | (i) The mapped output is transferred to the Reducers |
|
| |
| (7) Reduction | Reducers merge the list of key value pairs to generate the final result |
|
| |
| (8) Result | (i) Values are stored in HDFS |
Figure 5MapReduce architecture.
Figure 6Proposed data life cycle using the technologies and terminologies of Big Data.
Structured versus unstructured data.
| Structured data | Unstructured data | |
|---|---|---|
| Format | Row and columns | Binary large objects |
| Storage | Database Management Systems (DBMS) | Unmanaged documents and unstructured files |
| Metadata | Syntax | Semantics |
| Integration tools | Traditional Data Mining (ETL) | Batch processing |
DoS attack approaches.
| Defense strategy | Objectives | Pros | Cons |
|---|---|---|---|
| Defense against the new DoS attack [ | Detects the new type of DoS | (i) Prevents the bandwidth degradation | Unavailability of the service during application migration |
|
| |||
| FRC attack detection [ | Detects the FRC attack | No bandwidth wastage | (i) Cannot always identify the attacker |