| Literature DB >> 23259862 |
Hiromasa Horiguchi1, Hideo Yasunaga, Hideki Hashimoto, Kazuhiko Ohe.
Abstract
BACKGROUND: Secondary use of large scale administrative data is increasingly popular in health services and clinical research, where a user-friendly tool for data management is in great demand. MapReduce technology such as Hadoop is a promising tool for this purpose, though its use has been limited by the lack of user-friendly functions for transforming large scale data into wide table format, where each subject is represented by one row, for use in health services and clinical research. Since the original specification of Pig provides very few functions for column field management, we have developed a novel system called GroupFilterFormat to handle the definition of field and data content based on a Pig Latin script. We have also developed, as an open-source project, several user-defined functions to transform the table format using GroupFilterFormat and to deal with processing that considers date conditions.Entities:
Mesh:
Year: 2012 PMID: 23259862 PMCID: PMC3545829 DOI: 10.1186/1472-6947-12-151
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Figure 1Data transformation from log data to a wide table.
Figure 2Map Reduce architecture.
Figure 3Step-by-step example of data transformation. $ N indicates the Nth column in the data field.
Figure 4Step-by-step example of date management. $ N indicates the Nth column in the data field.
Results of the processing speed benchmark
| 261,369 | 554.5976 | 565.765 | 545.513 |
| 569,738 | 986.0611 | 1000.902 | 971.052 |
| 1,150,684 | 1911.162 | 1932.173 | 1890.286 |
| 2,301,367 | 3616.403 | 3,673.40 | 3,598.07 |
Figure 5Processing speed benchmark. Dots indicate the average processing time for 20 trials. The line indicates the prediction equation fitted with a linear regression.
Results of the scalability benchmark
| 2 | 6,892.868 | 6,986.503 | 6,844.374 |
| 4 | 3,616.403 | 3,673.398 | 3,598.065 |
| 8 | 2,063.208 | 2,087.145 | 2,037.378 |
| 12 | 1,301.092 | 1,326.391 | 1,280.319 |
| 16 | 1,022.917 | 1,133.464 | 985.958 |
| 24 | 677.832 | 690.765 | 670.458 |
| 48 | 379.049 | 401.013 | 370.314 |
Figure 6Scaling benchmark. Dots indicate the average processing time for 20 trials. The line indicates the prediction equation fitted with a power regression.