Luyao Chen1, Md Momin Aziz2, Noman Mohammed3, Xiaoqian Jiang4. 1. Heinz College, Carnegie Mellon University, United States. 2. Computer Science, University of Manitoba, Canada. Electronic address: azizmma@cs.umanitoba.ca. 3. Computer Science, University of Manitoba, Canada. 4. School of Biomedical Informatics, University of Texas Health Science Center at Houston, United States.
Abstract
BACKGROUND AND OBJECTIVE: Cloud computing plays a vital role in big data science with its scalable and cost-efficient architecture. Large-scale genome data storage and computations would benefit from using these latest cloud computing infrastructures, to save cost and speedup discoveries. However, due to the privacy and security concerns, data owners are often disinclined to put sensitive data in a public cloud environment without enforcing some protective measures. An ideal solution is to develop secure genome database that supports encrypted data deposition and query. METHODS: Nevertheless, it is a challenging task to make such a system fast and scalable enough to handle real-world demands providing data security as well. In this paper, we propose a novel, secure mechanism to support secure count queries on an open source graph database (Neo4j) and evaluated the performance on a real-world dataset of around 735,317 Single Nucleotide Polymorphisms (SNPs). In particular, we propose a new tree indexing method that offers constant time complexity (proportion to the tree depth), which was the bottleneck of existing approaches. RESULTS: The proposed method significantly improves the runtime of query execution compared to the existing techniques. It takes less than one minute to execute an arbitrary count query on a dataset of 212 GB, while the best-known algorithm takes around 7 min. CONCLUSIONS: The outlined framework and experimental results show the applicability of utilizing graph database for securely storing large-scale genome data in untrusted environment. Furthermore, the crypto-system and security assumptions underlined are much suitable for such use cases which be generalized in future work.
BACKGROUND AND OBJECTIVE: Cloud computing plays a vital role in big data science with its scalable and cost-efficient architecture. Large-scale genome data storage and computations would benefit from using these latest cloud computing infrastructures, to save cost and speedup discoveries. However, due to the privacy and security concerns, data owners are often disinclined to put sensitive data in a public cloud environment without enforcing some protective measures. An ideal solution is to develop secure genome database that supports encrypted data deposition and query. METHODS: Nevertheless, it is a challenging task to make such a system fast and scalable enough to handle real-world demands providing data security as well. In this paper, we propose a novel, secure mechanism to support secure count queries on an open source graph database (Neo4j) and evaluated the performance on a real-world dataset of around 735,317 Single Nucleotide Polymorphisms (SNPs). In particular, we propose a new tree indexing method that offers constant time complexity (proportion to the tree depth), which was the bottleneck of existing approaches. RESULTS: The proposed method significantly improves the runtime of query execution compared to the existing techniques. It takes less than one minute to execute an arbitrary count query on a dataset of 212 GB, while the best-known algorithm takes around 7 min. CONCLUSIONS: The outlined framework and experimental results show the applicability of utilizing graph database for securely storing large-scale genome data in untrusted environment. Furthermore, the crypto-system and security assumptions underlined are much suitable for such use cases which be generalized in future work.
Authors: Zachary D Stephens; Skylar Y Lee; Faraz Faghri; Roy H Campbell; Chengxiang Zhai; Miles J Efron; Ravishankar Iyer; Michael C Schatz; Saurabh Sinha; Gene E Robinson Journal: PLoS Biol Date: 2015-07-07 Impact factor: 8.029
Authors: Peter Claes; Denise K Liberton; Katleen Daniels; Kerri Matthes Rosana; Ellen E Quillen; Laurel N Pearson; Brian McEvoy; Marc Bauchet; Arslan A Zaidi; Wei Yao; Hua Tang; Gregory S Barsh; Devin M Absher; David A Puts; Jorge Rocha; Sandra Beleza; Rinaldo W Pereira; Gareth Baynam; Paul Suetens; Dirk Vandermeulen; Jennifer K Wagner; James S Boster; Mark D Shriver Journal: PLoS Genet Date: 2014-03-20 Impact factor: 5.917