| Literature DB >> 25383096 |
Emad A Mohammed1, Behrouz H Far1, Christopher Naugler2.
Abstract
The emergence of massive datasets in a clinical setting presents both challenges and opportunities in data storage and analysis. This so called "big data" challenges traditional analytic tools and will increasingly require novel solutions adapted from other fields. Advances in information and communication technology present the most viable solutions to big data analysis in terms of efficiency and scalability. It is vital those big data solutions are multithreaded and that data access approaches be precisely tailored to large volumes of semi-structured/unstructured data. THE MAPREDUCE PROGRAMMING FRAMEWORK USES TWO TASKS COMMON IN FUNCTIONAL PROGRAMMING: Map and Reduce. MapReduce is a new parallel processing framework and Hadoop is its open-source implementation on a single computing node or on clusters. Compared with existing parallel processing paradigms (e.g. grid computing and graphical processing unit (GPU)), MapReduce and Hadoop have two advantages: 1) fault-tolerant storage resulting in reliable data processing by replicating the computing tasks, and cloning the data chunks on different computing nodes across the computing cluster; 2) high-throughput data processing via a batch processing framework and the Hadoop distributed file system (HDFS). Data are stored in the HDFS and made available to the slave nodes for computation. In this paper, we review the existing applications of the MapReduce programming framework and its implementation platform Hadoop in clinical big data and related medical health informatics fields. The usage of MapReduce and Hadoop on a distributed system represents a significant advance in clinical big data processing and utilization, and opens up new opportunities in the emerging era of big data analytics. The objective of this paper is to summarize the state-of-the-art efforts in clinical big data analytics and highlight what might be needed to enhance the outcomes of clinical big data analytics tools. This paper is concluded by summarizing the potential usage of the MapReduce programming framework and Hadoop platform to process huge volumes of clinical data in medical health informatics related fields.Entities:
Keywords: Big data; Bioinformatics; Clinical big data analysis; Clinical data analysis; Distributed programming; Hadoop; MapReduce
Year: 2014 PMID: 25383096 PMCID: PMC4224309 DOI: 10.1186/1756-0381-7-22
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Figure 1The architecture of the Hadoop cluster. Hadoop cluster architecture, showing the distributed computing nodes, which are Master node (NameNode), Slave Nodes (DataNode), and the Ethernet switch.
Figure 2The WordCount problem MapReduce algorithm workflow. The algorithm counts the number of occurrences for every word in the file. The file is chunked and distributed over the computing nodes in the cluster. The mapper must be completed to start the reducer phase, otherwise an error will be reported and the execution will be stopped.
Figure 3The Hadoop ecosystems. The Hadoop system core, components (ecosystems), associated technology, and different distributions by vendors. This Figure illustrates the current Hadoop ecosystem and a short list of the available distributions by vendors.
Basic features of 14 Hadoop distributions and related download links
| Amazon Web Services Inc | • Amazon Elastic Block Store | |
| • Amazon Virtual Private Cloud | ||
| • GPU Instances | ||
| • High Performance Computing (HPC) Cluster | ||
| IBM Corp | • Social and Machine Data Analytics Accelerator | |
| • Provides a workload scheduler | ||
| • Includes Jaql, a declarative query language. | ||
| • Allows executing R jobs directly from the BigInsights web console. | ||
| Pivotal Corp | • A Fast, Proven SQL Database Engine for Hadoop | |
| • Enterprise Real-Time Data Service on Hadoop | ||
| • Familiar SQL Interface | ||
| • Hadoop In the Cloud: Pivotal HD Virtualized by VMware | ||
| Cloudera Inc | • HDFS Snapshots | |
| • Support for running Hadoop on Microsoft Windows | ||
| • YARN API stabilization | ||
| • Binary Compatibility for MapReduce applications built on hadoop-1.x | ||
| MapR Technologies Inc | • Finish small jobs quickly with MapR ExpressLane | |
| • Enable atomic, consistent point-in-time recovery with MapR Snapshots | ||
| Hortonworks Inc | • Use rich business intelligence (BI) tools such as Microsoft Excel, PowerPivot for Excel and Power View | |
| • HDP for Windows is the ONLY Hadoop distribution available for Windows Server. | ||
| Karmasphere Inc | • Ability to Use Existing SAS, SPSS and R Analytic Models | |
| Hadapt Inc | • Analyze both structured and unstructured data in a single, unified platform | |
| Super Micro Computer Inc | • Fully-validated, pre-configured SKUs optimized for Hadoop solutions | |
| Pentaho Corp | • Visual development for Hadoop data preparation and modeling | |
| Zettaset Inc | • Enterprise-Grade Hadoop Cluster Management | |
| Datastax Inc | • Powered by Apache Cassandra™, Certified for Production | |
| Datameer Inc | • Data Integration, Analytics, and Visualization | |
| Dell Inc | • Cloudera distribution for Hadoop |
Description of the Hadoop related projects/ecosystems
| Avro | • Avro is a framework for performing remote procedure calls and data serialization. | |
| Flume | • Flume is a tool for harvesting, aggregating and moving large amounts of log data in and out of Hadoop. | |
| HBase | • Based on Google’s Bigtable, HBase is an open-source, distributed, versioned, column-oriented store that sits on top of HDFS. HBase is column-based rather than row-based, which enables high-speed execution of operations performed over similar values across massive datasets. | |
| HCatalog | • An incubator-level project at Apache, HCatalog is a metadata and table storage management service for HDFS. | |
| Hive | • Hive provides a warehouse structure and SQL-like access for data in HDFS and other Hadoop input sources | |
| Mahout | • Mahout is a scalable machine-learning and data mining library. | |
| Oozie | • Oozie is a job coordinator and workflow manager for jobs executed in Hadoop, which can include non-MapReduce jobs. | |
| Pig | • Pig is a framework consisting of a high-level scripting language (Pig Latin) and a run-time environment that allows users to execute MapReduce on a Hadoop cluster. | |
| Sqoop | • Sqoop (SQL-to-Hadoop) is a tool which transfers data in both directions between relational systems and HDFS or other Hadoop data stores, e.g. Hive or HBase. | |
| ZooKeeper | • ZooKeeper is a service for maintaining configuration information, naming, providing distributed synchronization and providing group services. | |
| YARN | • YARN is a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications. | |
| Cascading | • Cascading is an alternative API to Hadoop MapReduce. Cascading now has support for reading and writing data to and from a HBase cluster. | |
| Twitter Storm | • Twitter Storm is a free and open source distributed real time computation system. | |
| High performance computing cluster (HPCC) | • HPCC is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions | |
| Dremel | • Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data |
Summary of reviewed research in clinical big data analysis using the MapReduce programming model
| Public database | A drug-adverse event extraction algorithm to support pharmacovigilance knowledge mining from PubMed citations/[ | 2011 | A MapReduce based algorithm for common adverse drug events (ADE) detection | Biomedical data mining |
| Identifying unproven cancer treatments on the health web: Addressing accuracy, generalizability and scalability/[ | 2012 | Using MapReduce and Markove boundary feature selection | Identify unproven cancer treatments on the health web | |
| A user-friendly tool to transform large scale administrative data into wide table format using a MapReduce program with a pig latin based script/[ | 2012 | MapRedcue and Pig Latin | Administrative data management | |
| Biometric | Leveraging the cloud for big data biometrics: Meeting the performance requirements of the next generation biometric systems/[ | 2011 | MapReduce machine learning algorithms for image regnition on Hadoop paltform | Design of secuirty system using biometric identification |
| Iris recognition on hadoop: A biometrics system implementation on cloud computing/[ | 2011 | Human iris MapReduce search algorithm on the cloud | Data retrival and secuirty system | |
| Cloud-ready biometric system for mobile security access/[ | 2012 | MapReduce algorithm to capture and recognition of biometric information | Biometric-identification mobile phone applications | |
| Genome and Protein data analysis | Parallelizing bioinformatics applications with MapReduce/[ | 2008 | MapRedcue algorithms | Bioinformatics applications |
| Cloudblast: Combining MapReduce and virtualization on distributed resources for bioinformatics applications/[ | 2008 | Cloud/MapReduce | Bioinformatics applications | |
| CloudBurst: highly sensitive read mapping with MapReduce/[ | 2009 | MapRedcue algorithms | Genome sequence mapping tool | |
| Cloud technologies for bioinformatics applications/[ | 2009 | Cloud/MapReduce | Bioinformatics applications | |
| The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data/[ | 2010 | HBase for data management and MapReduce jobs for computation | Genome sequence comparison application | |
| Nephele: genotyping via complete composition vectors and MapReduce/[ | 2011 | MapReduce Algorithms | Genotyping sequence tool | |
| A graphical execution platform for MapReduce programs on private and public clouds/[ | 2012 | Cloud/MapReduce | Bioinformatics applications | |
| Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework/[ | 2012 | MapReduce Algorithms | Bioinformatics applications | |
| An efficient algorithm for DNA fragment assembly in MapReduce/[ | 2012 | MapReduce algorithm for DNA framentation | A tool for DNA fragmentation assembly | |
| De novo assembly of high-throughput sequencing data with cloud computing and new operations on string graphs/[ | 2012 | String graph based on the MapReduce algorithms | Distributed Genome assembler | |
| Fractal MapReduce decomposition of sequence alignment/[ | 2012 | MapReduce Algorithms | Genome sequence alignment tool | |
| Genotyping in the cloud with crossbow/[ | 2012 | Cloud | Genotyping application | |
| BioPig: A hadoop-based analytic toolkit for large-scale sequence data [ | 2013 | MapReduce algorithms | Bioinformatics processing tool known as BioPig | |
| Implementation of a parallel protein structure alignment service on cloud/[ | 2013 | MapReduce alignment algorithm | Protein alignment application | |
| BlueSNP: R package for highly scalable genome-wide association studies using hadoop clusters/[ | 2013 | R alagorithms executed on top of the Hadoop platform | Statistical package in R for Genome analysis | |
| Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce/[ | 2013 | MapReduce algorithms | Enhanced algorithm | |
| Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing/[ | 2013 | Cloud | Whole-genome sequencing | |
| Study Category | Study Name/Reference | Study year | Technology used | Application |
| Genome and Protein data analysis | Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes/[ | 2013 | MapReduce Algorithms | multivariate neuroimaging phenotypes |
| Novel and efficient tag SNPs selection algorithms/[ | 2014 | MapReduce algorithm for efficient selection of SNP | Genom analysis | |
| Designing a parallel evolutionary algorithm for inferring gene networks on the cloud computing environment/[ | 2014 | Cloud | Algorithm for inferring gene networks | |
| Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline/[ | 2014 | Cloud | sequence analysis application | |
| Biomedical signal analysis | HBase, MapReduce, and integrated data visualization for processing clinical signal data/[ | 2011 | HBase for data mangement and MapReduce processing algorithm | Store and processing clinical signals |
| Parallel processing of massive EEG data with MapReduce/[ | 2012 | MapReduce EEMD algorithm | Massive biomedical signal processing | |
| Biomedical image analysis | Hadoop-gis: A high performance query system for analytical medical imaging with MapReduce/[ | 2011 | HBase for data management and MapReduce processing algorithm | Store and processing of medical images |
| Ultrafast and scalable cone-beam CT reconstruction using MapReduce in a cloud computing environment [ | 2011 | MapReduce image processing algorithms on the Cloud | Accelerates FDK algorithm for the cone-beam CT | |
| Using MapReduce for Large-Scale Medical Image Analysis/[ | 2012 | MapReduce algorithm | Medical Image Analysis |
The summary includes information related to the study (i.e. category, name, year, technology used, experiment design and potetial applications).