| Literature DB >> 34122663 |
Clara Betancourt1, Björn Hagemeier1, Sabine Schröder1, Martin G Schultz1.
Abstract
We present context-aware benchmarking and performance engineering of a mature TByte-scale air quality database system which was created by the Tropospheric Ozone Assessment Report (TOAR) and contains one of the world's largest collections of near-surface air quality measurements. A special feature of our data service https://join.fz-juelich.de is on-demand processing of several air quality metrics directly from the TOAR database. As a service that is used by more than 350 users of the international air quality research community, our web service must be easily accessible and functionally flexible, while delivering good performance. The current on-demand calculations of air quality metrics outside the database together with the necessary transfer of large volume raw data are identified as the major performance bottleneck. In this study, we therefore explore and benchmark in-database approaches for the statistical processing, which results in performance enhancements of up to 32%.Entities:
Keywords: Air quality data; Performance tests; Scientific database; Scientific web service
Year: 2021 PMID: 34122663 PMCID: PMC8180437 DOI: 10.1007/s12145-021-00631-4
Source DB: PubMed Journal: Earth Sci Inform ISSN: 1865-0473 Impact factor: 2.705
Fig. 1Configurations of database-driven data services: a Direct access to the database via SQL commands. b The database is accessed via a web service API; both service components run on the same hardware. c Our configuration as described in TOAR database and data services: as b, but the two service components operate on different hardware. The gear wheels and stop watches denote performance-critical data processing and data transfer
Benchmarks summary
| Benchmark | Tasks | Test cases |
|---|---|---|
| Aggregation of ozone values for given dates: - Count - Maximum - Average - Standard deviation | 1) ’Python’ 2) ’SQL’ 3) PL/Pythonu | |
| Ozone metrics for given series identifiers and years: - drmdmax1h - AOT40 - dma8epa - W90 | 1) ’Python’ 2) ’SQL’ | |
| Processing of full ozone table: - Parallel scan - Parallel aggregate | 1) max. 1 worker 2) max. 2 workers 3) max. 4 workers 4) max. 8 workers | |
| Aggregation of values for given dates: - Maximum value on a given date | 1) ’o3_hourly’ 2) ’temp_hourly’ | |
| Test bandwidth and latency | 1) Ping round trip time test 2) Bandwidth test |
For elaboration, see text
Fig. 2Graphical summary of the benchmark results. We give the ideal configuration for each benchmark and compare the relative performance enhancement with the baseline case
Benchmark Data aggregation inside versus outside the database results
| Aggre - | ’Python’ | ’SQL’ | ’PL/Python’ | Difference |
|---|---|---|---|---|
| gate | ’Python’- | |||
| ’SQL’ | ||||
| ’count’ | 0.23 ± 0.05 | 0.22 ± 0.03 | 21.7 % | |
| ’max’ | 0.27 ± 0.05 | 0.26 ± 0.07 | 22.2 % | |
| ’avg’ | 0.25 ± 0.04 | 0.26 ± 0.03 | 32.0 % | |
| ’std’ | 0.23 ± 0.04 | 0.27 ± 0.06 | 30.4 % |
The cells contain execution time and standard deviation (n = 100 each). The fastest aggregation is marked in bold
Benchmark Metrics calculation inside versus outside the database results
| Metric | ’Python’ | ’SQL’ | Difference |
|---|---|---|---|
| ’drmdmax1h’ | 0.18 ± 0.01 | 16.7 % | |
| ’AOT40’ | 0.19 ± 0.02 | 21.1 % | |
| ’dma8epa’ | 0.18 ± 0.02 | 5.6 % | |
| ’W90’ | 0.18 ± 0.02 | 5.6 % |
The cells contain execution time and standard deviation (n = 250 each). The fastest metrics calculation is marked in bold
Benchmark Parallel processing results
| 1 worker | 2 workers | 4 workers | 8 workers | |
|---|---|---|---|---|
| Task | allowed | allowed | allowed | allowed |
| [#] / [s] / [%] | [#] / [s] / [%] | [#] / [s] / [%] | [#] / [s] / [%] | |
| 1 | 2 | 4 | 5 | |
| ’scan’ | 0.93 | 0.68 | 0.98 | |
| - 26.9 % | + 5.4 % | |||
| 1 | 2 | 4 | 8 | |
| ’agg’ | 99.03 | 78.71 | 76.10 | |
| - 20.5 % | - 32.2 % |
The cells contain the actual number of workers spawned, the execution time, and the difference to execution time with one worker. The best result for each task is marked in bold
Benchmark Influence of indices on query times results
| ’o3_hourly’ | ’temp_hourly’ | |
|---|---|---|
| Aggregate | (n = 100) | (n = 20) |
| ’max SQL’ | 82.63 ± 2.41 |
The fastest aggregation is marked in bold
Benchmark Transfer times between database server and web server results
| Transfer | Bandwidth | Latency |
|---|---|---|
| Between VMs | 0.7 ms | |
| in OpenStack | 8.3 GBit/s | ping round |
| cloud environment | trip time |
| Metric | Application | Description |
|---|---|---|
| Human health | 3-months running mean of daily maximum 1-hour mixing ratios are calculated. For annual statistics, the maximum value will be computed. | |
| Vegetation | Daily 12-h AOT40 values are accumulated using hourly values for the 12-h period from 08:00h until 19:59h solar time interval. AOT40 is defined as cumulative ozone above 40 ppb. If less than 75 % of hourly values (i.e. less than 9 out of 12 hours) are present, the cumulative AOT40 is considered missing. When there exist 75 % or greater data capture in the daily 12-h window, the scaling by fractional data capture (ntotal/nvalid) is utilized. For monthly, seasonal, summer, or annual statistics, the daily AOT40 values are accumulated over the aggregation period and scaled by (ntotal/nvalid) days. If less than 75 % of days are valid, the value is considered missing. | |
| Human health | Daily maximum 8-hour average statistics according to the US EPA definition. 8-hour averages are calculated for 24 bins starting at 00:00h local time. The 8-h running mean for a particular hour is calculated on the concentration for that hour plus the following 7 hours. If less than 75 % of data are present (i.e. less than 6 hours), the average is considered missing. A daily value is considered valid if at least 18 hourly averages are valid. For annual or seasonal statistics, the 4th highest daily 8-hour maximum of the aggregation period will be computed. | |
| Human health | EI = SUM(w |