| Literature DB >> 26069879 |
Anand Kumar1, Vladimir Grupcev1, Meryem Berrada1, Joseph C Fogarty2, Yi-Cheng Tu1, Xingquan Zhu3, Sagar A Pandit2, Yuni Xia4.
Abstract
Molecular Simulation (MS) is a powerful tool for studying physical/chemical features of large systems and has seen applications in many scientific and engineering domains. During the simulation process, the experiments generate a very large number of atoms and intend to observe their spatial and temporal relationships for scientific analysis. The sheer data volumes and their intensive interactions impose significant challenges for data accessing, managing, and analysis. To date, existing MS software systems fall short on storage and handling of MS data, mainly because of the missing of a platform to support applications that involve intensive data access and analytical process. In this paper, we present the database-centric molecular simulation (DCMS) system our team developed in the past few years. The main idea behind DCMS is to store MS data in a relational database management system (DBMS) to take advantage of the declarative query interface (i.e., SQL), data access methods, query processing, and optimization mechanisms of modern DBMSs. A unique challenge is to handle the analytical queries that are often compute-intensive. For that, we developed novel indexing and query processing strategies (including algorithms running on modern co-processors) as integrated components of the DBMS. As a result, researchers can upload and analyze their data using efficient functions implemented inside the DBMS. Index structures are generated to store analysis results that may be interesting to other users, so that the results are readily available without duplicating the analysis. We have developed a prototype of DCMS based on the PostgreSQL system and experiments using real MS data and workload show that DCMS significantly outperforms existing MS software systems. We also used it as a platform to test other data management issues such as security and compression.Entities:
Keywords: Data compression; Molecular dynamics; Molecular simulation; Scientific database; Spatiotemporal database
Year: 2014 PMID: 26069879 PMCID: PMC4456345 DOI: 10.1186/s40537-014-0009-5
Source DB: PubMed Journal: J Big Data ISSN: 2196-1115
Figure 1Snapshots of two MS systems: a collagen fiber structure with 890,000 atoms (top) and a dipalmitoylphosphatidylcholine (DPPC) bi-layer lipid system with 402,400 atoms (bottom).
Popular analytical queries in MS
|
|
|
|---|---|
| Moment of inertia |
|
| Moment of inertia on |
|
| Sum of masses |
|
| Center of mass |
|
| Radius of gyration |
|
| Dipole moment |
|
| Dipole histogram |
|
| Electron density |
|
| Heat capacity |
|
| Isothermal compressibility |
|
| Mean square displacement |
|
| Diffusion constant |
|
| Velocity autocorrelation |
|
| Force autocorrelation |
|
| Density function | Histogram of atom counts |
| Spatial distance histogram (SDH) | Histogram of all atom-to-atom distances |
| RDF |
|
Figure 2The DCMS architecture.
Figure 3Schema of the DCMS database. The golden key symbol marks the primary keys and the lines represent foreign keys. Note that there exist foreign keys from all Ids in tables Connection, Torsion, and Angle to the Atom Static Info table. We did not draw them due to space limit.
Figure 4A 2D illustration of an irregular query region. Thin lines represent inclusive tree nodes visited at the 5th level in the Quadtree (the level 0 node in the tree covers the whole region).
Figure 5Intuition behind computation of SDH by considering pairs of nodes in a density function. Curves represent distribution of distances between two nodes (blue) or within a single node (red).
Figure 6Two spatial queries (Q1, Q2) and three recorded views (A, B, C) in a 2D space.
Figure 7Steps in our MS data compression algorithm (borrowed from [ 30 ]).
Query processing time (in seconds) in database-centric and file-based MS analysis
|
|
| ||||
|---|---|---|---|---|---|
|
|
|
|
|
| |
| DCMS + TPS | 0.0008 | 13.4 | 2.56 | 0.073* | 0.029 |
| DCMS | 0.069 | 6239 | 2.48 | 0.122* | 0.198 |
| GROMACS | 45.0 | 16410 | 52.5 | 8.49 | 16.8 |
*time depends on the query range
Figure 8Performance of different SDH algorithms under different histogram resolutions.
Running time (seconds) of brute-force SDH method on GPU
|
|
|
|
| ||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
| 100,000 | 1.86 | 11.15 | 424.7 | 228 | 38 |
| 300,000 | 15.24 | 96.6 | 3812.2 | 250 | 39 |
| 800, 000 | 105.15 | 677.1 | 27142 | 258 | 40 |
| 8,000,000 | 3 hrs. | – | > 27 days | > 216 | – |
Figure 9Performance of our MS data compression method: (a) compression ratio and error; (b) effects on radial distribution function (RDF). More details can be found in [30].