| Literature DB >> 29375652 |
Dillon Chrimes1, Hamid Zamani2.
Abstract
Big data analytics (BDA) is important to reduce healthcare costs. However, there are many challenges of data aggregation, maintenance, integration, translation, analysis, and security/privacy. The study objective to establish an interactive BDA platform with simulated patient data using open-source software technologies was achieved by construction of a platform framework with Hadoop Distributed File System (HDFS) using HBase (key-value NoSQL database). Distributed data structures were generated from benchmarked hospital-specific metadata of nine billion patient records. At optimized iteration, HDFS ingestion of HFiles to HBase store files revealed sustained availability over hundreds of iterations; however, to complete MapReduce to HBase required a week (for 10 TB) and a month for three billion (30 TB) indexed patient records, respectively. Found inconsistencies of MapReduce limited the capacity to generate and replicate data efficiently. Apache Spark and Drill showed high performance with high usability for technical support but poor usability for clinical services. Hospital system based on patient-centric data was challenging in using HBase, whereby not all data profiles were fully integrated with the complex patient-to-hospital relationships. However, we recommend using HBase to achieve secured patient data while querying entire hospital volumes in a simplified clinical event model across clinical services.Entities:
Mesh:
Year: 2017 PMID: 29375652 PMCID: PMC5742497 DOI: 10.1155/2017/6120820
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Big Data applications related to clinical services [11–13, 18].
| Clinical services | Healthcare Applications |
|---|---|
| R&D | (i) Targeted R&D pipeline in drugs and devices, clinical trial design, and patient recruitment to better match treatments to individual patients, thus reducing trial and failures and speeding new treatments to market, follow on indications, and discover adverse effects before products reach the market |
|
| |
| Public health | (i) Targeted vaccines, e.g., choosing the annual influenza strains |
|
| |
| Evidence-based medicine | (i) Combine and analyze a variety of structured and unstructured data-EMRs, financial and operational data, clinical data, and genomic data to match treatments with outcomes, predict patients at risk for disease or readmission, and provide more efficient care |
|
| |
| Genomic analytics | (i) Make genomic analysis a part of the regular medical care decision process and the growing patient medical record |
|
| |
| Device/remote monitors | (i) Capture and analyze in real-time large volumes of fast-moving data from in-hospital and in-home devices, for safety monitoring and adverse prediction |
|
| |
| Patient profile analytics | (i) Identify individuals who would benefit from proactive care or lifestyle changes, for example, those patients at risk of developing a specific disease (e.g., diabetes) who would benefit from preventive care |
Big Data technologies using Hadoop with possible applications in healthcare [5, 7–9, 11–13, 29, 37–42].
| Technologies | Clinical utilization |
|---|---|
| Hadoop Distributed File System (HDFS) | It has clinical use because of its high capacity, fault tolerant, and inexpensive storage of very large datasets clinical. |
|
| |
| MapReduce | The programming paradigm has been used for processing clinical Big Data. |
|
| |
| Hadoop | Infrastructure adapted for clinical data processing. |
|
| |
| Spark | Processing/storage of clinical data indirectly. |
|
| |
| Cassandra | Key-value store for clinical data indirectly. |
|
| |
| HBase | NoSQL database with random access was used for clinical data. |
|
| |
| Apache Solr | Document warehouse indirectly for clinical data. |
|
| |
| Lucene and Blur | Document warehouse not yet in healthcare, but upcoming for free text query on Hadoop platform, can be used for clinical data. |
|
| |
| MongoDB | JSON document-oriented database has been used for clinical data. |
|
| |
| Hive | Data interaction not yet configured for clinical data, but SQL layer to cross platform being possible. |
|
| |
| Spark SQL | SQL access to Hadoop data not yet configured for clinical data. |
|
| |
| JSON | Data description and transfer has been used for clinical data. |
|
| |
| ZooKeeper | Coordination of data flow has been used for clinical data. |
|
| |
| YARN | Resource allocator of data flow has been used for clinical data. |
|
| |
| Oozie | A workflow scheduler to manage complex multipart Hadoop jobs not currently used for clinical data. |
|
| |
| Pig | High-level data flow language for processing batches of data, but not used for clinical data. |
|
| |
| Storm | Streaming ingestions were used for clinical data. |
Box 1Information from interviewed groups involved in clinical reporting at Vancouver Island Health Authority (VIHA).
Use cases and patient encounter scenarios related to metadata of patient visits and its database placement related to query output.
| Case | Clinical Database |
|---|---|
| Uncontrolled type 2 diabetes & complex comorbidities | (i) DAD with diagnosis codes, HBase for IDs |
|
| |
| TB of the lung & uncontrolled DM 2 | (i) DAD and ADT columns with HBase for patient IDs |
|
| |
| A on C renal failure, fracture, heart failure to CCU, and stable DM 2 | (i) DAD and ADT columns with HBase for patient IDs |
|
| |
| Multilocation cancer patient on Palliative | (i) DAD and ADT columns with HBase integrating data together |
|
| |
| 1 cardiac with complications | (i) DAD and ADT columns with HBase integrating data together |
|
| |
| 1 ER to surgical, fracture, readmitted category for 7 days and some complication after | (i) DAD and ADT columns with HBase integrating data together |
|
| |
| 1 simple day-surg. with complication, admitted to inpatient (allergy to medication) | (i) DAD and ADT columns with HBase for patient IDs |
|
| |
| 1 cardiac with complications and death | (i) DAD and ADT columns with HBase integrating data together |
|
| |
| 1 normal birth with postpartum hemorrhage complication | (i) DAD and ADT columns with HBase integrating data together |
|
| |
| 1 HIV/AIDS patient treated for an infection | (i) DAD and ADT columns with HBase for patient IDs |
|
| |
| Strep A infection | (i) DAD and ADT columns with HBase integrating data together |
|
| |
| Cold but negative Strep A. Child | (i) DAD and ADT columns with HBase integrating data together |
|
| |
| Adult patient with Strep A. positive | (i) DAD and ADT columns with HBase for patient IDs |
|
| |
| Severe pharyngitis | (i) DAD and ADT columns with HBase integrating data together |
|
| |
| Child, moderate pharyngitis, throat culture negative, physical exam | (i) DAD and ADT columns with HBase for patient IDs |
|
| |
| Adult, history of heart disease, positive culture for Strep A. | (i) DAD and ADT columns with HBase integrating data together |
|
| |
| Adult, physical exam, moderate pharyngitis, positive for strep A. culture and positive second time, readmitted | (i) DAD and ADT columns with HBase for patient IDs |
Figure 1Big Data Analytics (BDA) platform designed and constructed as patient encounter database of hospital system.
Box 2Configuration and command scripts run across BDA platform.
Figure 2Performance (seconds) of 60 ingestions (i.e., 20 replicated 3 times) from Hadoop HDFS to HBase files, MapReduce indexing, and query results. Dashed line is total ingestion time and the dotted line is time to complete the Reducer of MapReduce. The bottom dashed-dot lines are the times to complete Map of MapReduce and the duration (seconds) to run the queries.
Operational experiences, persistent issues, and overall limitations of tested Big Data technologies and components that impacted Big Data Analytics (BDA) platform.
| Technology component | Clinical impact to platform |
|---|---|
| Hadoop Distributed Filing System (HDFS) | (i) Did not reconfigure more than 6 nodes because it is very difficult to maintain clinical data |
|
| |
| MapReduce | (i) Totally failed ingestion |
|
| |
| HBase | (i) |
|
| |
| ZooKeeper & YARN | (i) Extremely slow performance when ZooKeeper services are not running properly for both, but additional configuration minimized this limitation with few issues for YARN |
|
| |
| Phoenix | (i) To maintain a database schema with current names in a file on the nodes, such that if the files ingested do not match, it will show error, and to verify ingested data exists within the metadata of schema while running queries |
|
| |
| Spark | (i) Slow performance |
|
| |
| Zeppelin | (i) 30-minute delay before running queries which takes the same amount of time as with Jupyter |
|
| |
| Jupyter | (i) Once the Java is established, it has high usability and excellent performance |
|
| |
| Drill | (i) It is extremely fast but has poor usability |
Figure 3A year of varied iteration and CPU usage (at 100%) on Hemes89 node reported from WestGrid showing variation in the duration of the ingestion of 50 million records over each of the iterations. The graph shows the following: user (in red), system (in green), IOWait time (in blue), and CPU Max (black line).
Figure 4Zeppelin interface with Apache Spark with multiple notebooks that can be selected by clinical users.
Figure 5Spark with Jupyter and SQL-like script to run all queries in sequence and simultaneously.
Figure 6Drill interface customized using the distributed mode of Drill with local host and running queries over WestGrid and Hadoop.