| Literature DB >> 26779379 |
Manasi Vartak1, Sajjadur Rahman2, Samuel Madden1, Aditya Parameswaran2, Neoklis Polyzotis3.
Abstract
Data analysts often build visualizations as the first step in their analytical workflow. However, when working with high-dimensional datasets, identifying visualizations that show relevant or desired trends in data can be laborious. We propose SeeDB, a visualization recommendation engine to facilitate fast visual analysis: given a subset of data to be studied, SeeDB intelligently explores the space of visualizations, evaluates promising visualizations for trends, and recommends those it deems most "useful" or "interesting". The two major obstacles in recommending interesting visualizations are (a) scale: evaluating a large number of candidate visualizations while responding within interactive time scales, and (b) utility: identifying an appropriate metric for assessing interestingness of visualizations. For the former, SeeDB introduces pruning optimizations to quickly identify high-utility visualizations and sharing optimizations to maximize sharing of computation across visualizations. For the latter, as a first step, we adopt a deviation-based metric for visualization utility, while indicating how we may be able to generalize it to other factors influencing utility. We implement SeeDB as a middleware layer that can run on top of any DBMS. Our experiments show that our framework can identify interesting visualizations with high accuracy. Our optimizations lead to multiple orders of magnitude speedup on relational row and column stores and provide recommendations at interactive time scales. Finally, we demonstrate via a user study the effectiveness of our deviation-based utility metric and the value of recommendations in supporting visual analytics.Entities:
Year: 2015 PMID: 26779379 PMCID: PMC4714568
Source DB: PubMed Journal: Proceedings VLDB Endowment ISSN: 2150-8097
Figure 1Motivating Example
Datasets used for testing
| Name | Description | Size | | | | | Views | Size (MB) |
|---|---|---|---|---|---|---|
| Synthethic Datasets | ||||||
| SYN | Randomly distributed, varying # distinct values | 1M | 50 | 20 | 1000 | 411 |
| SYN*-10 | Randomly distributed, 10 distinct values/dim | 1M | 20 | 1 | 20 | 21 |
| SYN*-100 | Randomly distributed, 100 distinct values/dim | 1M | 20 | 1 | 20 | 21 |
| Real Datasets | ||||||
| BANK | Customer Loan dataset | 40K | 11 | 7 | 77 | 6.7 |
| DIAB | Hospital data about diabetic patients | 100K | 11 | 8 | 88 | 23 |
| AIR | Airline delays dataset | 6M | 12 | 9 | 108 | 974 |
| AIR10 | Airline dataset scaled 10X | 60M | 12 | 9 | 108 | 9737 |
| Real Datasets - User Study | ||||||
| CENSUS | Census data | 21K | 10 | 4 | 40 | 2.7 |
| HOUSING | Housing prices | 0.5K | 4 | 10 | 40 | <1 |
| MOVIES | Movie sales | 1K | 8 | 8 | 64 | 1.2 |
Figure 2SeeDB Frontend
Figure 3SeeDB Architecture
Figure 4Confidence Interval based Pruning
Figure 5Performance gains from all optimizations
Figure 6Baseline performance
Figure 7Effect of Group-by and Parallelism
Figure 8Effect of Groups and Dimensions
Figure 9Effect of All Optimizations
Figure 10Distribution of Utilities
Figure 11Bank dataset result quality
Figure 12Diabetes dataset result quality
Figure 13Latency Across Datasets
Figure 14Examples of ground truth for visualizations
Figure 15Performance of Deviation metric for Census data
Aggregate Visualizations: Bookmarking Behavior Overview
| total_viz | num_bookmarks | bookmark_rate | |
|---|---|---|---|
| MANUAL | 6.3 ± 3.8 | 1.1 ± 1.45 | 0.14 ± 0.16 |
|
| 10.8 ± 4.41 | 3.5 ± 1.35 | 0.43 ± 0.23 |