| Literature DB >> 25825667 |
Wullianallur Raghupathi1, Viju Raghupathi2.
Abstract
OBJECTIVE: To describe the promise and potential of big data analytics in healthcare.Entities:
Keywords: Analytics; Big data; Framework; Hadoop; Healthcare; Methodology
Year: 2014 PMID: 25825667 PMCID: PMC4341817 DOI: 10.1186/2047-2501-2-3
Source DB: PubMed Journal: Health Inf Sci Syst ISSN: 2047-2501
Figure 1An applied conceptual architecture of big data analytics.
Platforms & tools for big data analytics in healthcare
| Platform/Tool | Description |
|---|---|
| The Hadoop Distributed File System (HDFS) | HDFS enables the underlying storage for the Hadoop cluster. It divides the data into smaller parts and distributes it across the various servers/nodes. |
| MapReduce | MapReduce provides the interface for the distribution of sub-tasks and the gathering of outputs. When tasks are executed, MapReduce tracks the processing of each server/node. |
| PIG and PIG Latin (Pig and PigLatin) | Pig programming language is configured to assimilate all types of data (structured/unstructured, etc.). It is comprised of two key modules: the language itself, called PigLatin, and the runtime version in which the PigLatin code is executed. |
| Hive | Hive is a runtime Hadoop support architecture that leverages Structure Query Language (SQL) with the Hadoop platform. It permits SQL programmers to develop Hive Query Language (HQL) statements akin to typical SQL statements. |
| Jaql | Jaql is a functional, declarative query language designed to process large data sets. To facilitate parallel processing, Jaql converts “‘high-level’ queries into ‘low-level’ queries” consisting of MapReduce tasks. |
| Zookeeper | Zookeeper allows a centralized infrastructure with various services, providing synchronization across a cluster of servers. Big data analytics applications utilize these services to coordinate parallel processing across big clusters. |
| HBase | HBase is a column-oriented database management system that sits on top of HDFS. It uses a non-SQL approach. |
| Cassandra | Cassandra is also a distributed database system. It is designated as a top-level project modeled to handle big data distributed across many utility servers. It also provides reliable service with no particular point of failure ( |
| Oozie | Oozie, an open source project, streamlines the workflow and coordination among the tasks. |
| Lucene | The Lucene project is used widely for text analytics/searches and has been incorporated into several open source projects. Its scope includes full text indexing and library search for use within a Java application. |
| Avro | Avro facilitates data serialization services. Versioning and version control are additional useful features. |
| Mahout | Mahout is yet another Apache project whose goal is to generate free applications of distributed and scalable machine learning algorithms that support big data analytics on the Hadoop platform. |
Outline of big data analytics in healthcare methodology
| Step 1 | Concept statement |
| • Establish need for big data analytics project in healthcare based on the “4Vs”. | |
| Step 2 | Proposal |
| • What is the problem being addressed? | |
| • Why is it important and interesting? | |
| • Why big data analytics approach? | |
| • Background material | |
| Step 3 | Methodology |
| • Propositions | |
| • Variable selection | |
| • Data collection | |
| • ETL and data transformation | |
| • Platform/tool selection | |
| • Conceptual model | |
| • Analytic techniques | |
| -Association, clustering, classification, etc. | |
| • Results & insight | |
| Step 4 | Deployment |
| • Evaluation & validation | |
| • Testing |
Source: Adapted from [Raghupathi & Raghupathi, [9]].