Literature DB >> 28786366

Secure searching of biomarkers through hybrid homomorphic encryption scheme.

Miran Kim¹, Yongsoo Song², Jung Hee Cheon².

Abstract

BACKGROUND: As genome sequencing technology develops rapidly, there has lately been an increasing need to keep genomic data secure even when stored in the cloud and still used for research. We are interested in designing a protocol for the secure outsourcing matching problem on encrypted data.
METHOD: We propose an efficient method to securely search a matching position with the query data and extract some information at the position. After decryption, only a small amount of comparisons with the query information should be performed in plaintext state. We apply this method to find a set of biomarkers in encrypted genomes. The important feature of our method is to encode a genomic database as a single element of polynomial ring. RESULT: Since our method requires a single homomorphic multiplication of hybrid scheme for query computation, it has the advantage over the previous methods in parameter size, computation complexity, and communication cost. In particular, the extraction procedure not only prevents leakage of database information that has not been queried by user but also reduces the communication cost by half. We evaluate the performance of our method and verify that the computation on large-scale personal data can be securely and practically outsourced to a cloud environment during data analysis. It takes about 3.9 s to search-and-extract the reference and alternate sequences at the queried position in a database of size 4M.
CONCLUSION: Our solution for finding a set of biomarkers in DNA sequences shows the progress of cryptographic techniques in terms of their capability can support real-world genome data analysis in a cloud environment.

Entities: Chemical Disease Gene Species

Keywords: Biomarkers; Homomorphic encryption

Mesh：

Substances：
Biomarkers

Year: 2017 PMID： 28786366 PMCID： PMC5547457 DOI： 10.1186/s12920-017-0280-3

Source DB: PubMed Journal: BMC Med Genomics ISSN： 1755-8794 Impact factor: 3.063

Background

The rapid development of genome sequencing technology enables us to access large genome dataset and it looks poised to make a significant breakthrough in medical research. While genomic data can be used for a wide range of applications including healthcare, biomedical research, and direct-to-consumer services, it has numerous special distinguishing features and it can violate personal privacy via genetic disclosure or genetic discrimination [1-3]. Due to these potential privacy issues, it should be managed with care. There have been various privacy-enhancing techniques using cryptographic methods as outsourced analysis tools of genomic data. Recently, it has been suggested that we can preserve privacy through homomorphic encryption (HE), which allows computations to be carried out on ciphertexts. Yasuda et al. [4] gave a practical solution to find the location of a pattern in a text by computing multiple Hamming distance values on encrypted data. Lauter et al. [5] gave a solution to privately compute the basic genomic algorithms used in genome-wide association studies. Homomorphic encryption can be applied to privacy-preserving sequence comparison, but it is still impractical for the analysis of entire human genome information. For example, Cheon et al. [6] presented a protocol to compute the edit distance on homomorphically encrypted data but it took about 27 s even on length 8 DNA sequence. It is not easy to efficiently approximate the edit distance over encryption even though the distance to a public human DNA sequence is given [7]. This inefficiency comes from the difficulty of homomorphic evaluation of equality test: Encrypting the inputs bit-wise and computing over the encrypted bits yield expensive computation cost (at least linear in the data bit-length). In this paper, we suggest an efficient method to securely search a set of biomarkers using hybrid Ring-GSW homomorphic encryption scheme.

Problem setting

The iDASH (Integrating Data for Analysis, ‘anonymization’ and SHaring) National Center organizes the iDASH Privacy & Security challenge for secure genome analysis. This paper is based on a submission to the task 3 in 2016 iDASH challenge: secure outsourcing of testing for genetic diseases on encrypted genomes. The goal of this task is to privately calculate the probability of genetic diseases through matching a set of biomarkers to encrypted genomes stored in a public cloud service. The requirement is that the entire matching process needs to be carried out using homomorphic encryption so that any information about database and query should not be revealed to the server during computation. Suppose that the client has a Variation Call Format (VCF) file which contains genotype information such as chromosome number and position in the genome. It also contains some information for each position such as reference and alternate sequences, where each base must be one of SNPs: A,T,G, and C. The client encrypts the information using homomorphic encryption and the server calculates the exact match over the encrypted data. The outcome is the absence/presence of the specified biomarkers, that is, an encryption of 1 if matched; otherwise, an encryption of 0. Finally the client decrypts the result by the secret key of homomorphic encryption.

Practical homomorphic encryption

Fully Homomorphic cryptosystems allow us to homomorphically evaluate any arithmetic circuit without decryption. However, the noise of the resulting ciphertext grows during homomorphic evaluations, slightly with addition but substantially with multiplication. For efficiency reasons, for tasks which are known in advance, we use a more practical Somewhat Homomorphic Encryption (SHE) scheme, which evaluates functions up to a certain complexity. In particular, two techniques are used for noise management of SHE: one is the modulus-switching technique introduced by Brakerski, Gentry and Vaikuntanathan [8], which scales down a ciphertext during every multiplication operation and reduces the noise by its scaling factor. The other is a scale-invariant technique proposed by Brakerski such that the same modulus is used throughout the evaluation process [9]. Let us denote by [ ·] the reduction modulo Q into the interval of the integer or integer polynomial (coefficient-wise). For a security parameter λ, we choose an integer M=M(λ) that defines the M-th cyclotomic polynomial Φ (X). For a polynomial ring , set the plaintext space to for some fixed t≥2 and the ciphertext space to for an integer Q=Q(λ). Let χ=χ(λ) denote a noise distribution over the ring R. We use the standard notation to denote that a is chosen from the distribution .

The basic scheme

The following is a description of basic homomorphic encryption scheme based on the hardness of (decisional) Ring Learning with Errors (RLWE) assumption, which was first introduced by Lyubashevsky et al. [10]. The assumption is that it is infeasible to distinguish the following two distributions. The first distribution consists of pairs (a ,u ), where a and u are drawn uniformly at random from . The second distribution consists of pairs of the form (a ,b )=(a ,a s+e ) where a is uniformly random in and s,e are drawn from the error distribution χ. To improve efficiency for HE, we use sparse secret keys s with coefficients sampled from {0,±1} as in [11]. RLWE.ParamsGen(λ): Given the security parameter λ, choose an integer M, a modulus Q, a plaintext modulus t with t|Q, and discrete Gaussian distribution χ . Output params←(M,Q,t,χ ). RLWE.KeyGen(params): On the input parameters, let N=ϕ(M) and choose a sparse random s from {0,±1}. Generate an RLWE instance (a,b)=(a,[−a s+e]) for e←χ . We set the secret key sk←s and the public key pk←(a,b). RLWE.Enc(m,pk): To encrypt , choose a small polynomial v and two Gaussian polynomials e 0,e 1 over and output the ciphertext RLWE.Dec(ct,sk): Given a ciphertext ct=(c 0,c 1), output m←⌊(t/Q)·[c 0+s·c 1]⌉. RLWE.Add(ct,ct ′): Given two ciphertexts ct=(c 0,c 1) and ct ′=(c0′,c1′), the homomorphic addition is computed by ct ←([c 0+c0′],[c 1+c1′]). Throughout this paper, we assume that the integer M is a power of two so that N=M/2 and ϕ (X)=X +1. We adapt the conversion and modulus-switching techniques of [12]. The conversion algorithm changes an RLWE encryption of into an LWE encryption of its constant term m 0, and the modulus switching reduces the ciphertext modulus Q down to q while preserving the message. We note that an LWE ciphertext is represented as a vector in for some modulus q, and the decryption procedure is done by an inner product of the ciphertext and the secret key vector. RLWE.Conv(ct): Given a ciphertext ct=(c 0,c 1) with and , output the vector ct ′=(c 0,0,c 1,0,−c 1,,…,−c 1,1). LWE.ModSwitch(ct): Given a ciphertext , output the vector . An RLWE ciphertext ct=(c 0,c 1) has the decryption structure of the form c 0+c 1·s=(Q/t)·m+e and its constant term is It can be represented as an inner product of a vector (c 0,0,c 1,0,−c 1,,−c 1,,…,−c 1,1) and the desired LWE secret key . Hence the output of the conversion algorithm can be seen as an LWE encryption of m 0. It is also easy to check that if satisfies , then the output of LWE.ModSwitch algorithm satisfies for some e ′≈(q/Q)·e. These techniques have been proposed for an efficient bootstrapping [12], but they will play totally different roles in our application. Finally an LWE ciphertext of modulus q can be decrypted by as follows. LWE.Dec(ct,sk): Given a ciphertext , output the value . If for some small enough e, it returns the correct message m modulo t. More precisely, the decryption procedure works if |te/q|<1/2.

The Ring-GSW scheme

Gentry et al. [13] suggested a fully homomorphic encryption based on the LWE problem, where the message is encrypted as an approximate eigenvalue of a ciphertext. Ducas and Micciancio [12] described its RLWE variant. The RGSW symmetric encryption scheme consists of the following algorithms. RGSW.ParamsGen(·),RGSW.KeyGen(·): Use the same parameter params and secret key s with the basic RLWE scheme. Additionally set the decomposition base B g and exponent d g satisfying . RGSW.Enc(m,sk): To encrypt , pick a matrix uniformly at random, and with discrete Gaussian distribution χ of parameter ς, and output the ciphertext where b=−a·s+e and the gadget matrix for 2×2 identity matrix I. Let be the decomposition with the base B g, where the dimension of input vector is multiplied by d g through this algorithm. The RGSW encryption of m satisfies . Roughly, m is an approximate eigenvalue of with respect to the eigenvector . In [14], the hybrid multiplication between RGSW ciphertexts and RLWE ciphertexts has been defined as follows. Hybrid.Mult(CT,ct): Given an RGSW ciphertext and an RLWE ciphertext output the vector . If CT and ct are RGSW and RLWE encryptions of m and m ′, respectively, their multiplication ct ′ is a valid RLWE encryption of m m ′. For convenience, we will denote Hybrid.Mult(CT,ct) algorithm by , i.e., .

Methods

Privacy-preserving database searching and extraction

Let us consider a database of a set of n tuples. Each tuple consists of pairs (d ,α ) for i=1,…,n, where d denotes a data-tag in the domain and α represents the corresponding value attribute in a plaintext space .Note that all the tags should be distinct from each other. For instance, in the case of personal information database, α may be the age of user whose identity number is d . Given a query tag d from a tag domain and a query value α from a plaintext space, the matching problem is to determine the existence of an index i such that (d,α)=(d ,α ). Now consider the following simplified search query: select α if there exists an index i such that d =d; otherwise zero (⊥). The purpose of this section is to store the database and carry out this search query on the public cloud. The server should learn nothing from encrypted query and any information other than the final result should not be leaked to user. Throughout this work, we will use semi-honest (honest but curious) adversary model, which is a standard assumption for evaluation of homomorphic encryption. Our main idea is the following encoding method of database suitable for the efficient computation of equality test and extraction: The user encrypts this polynomial with the RLWE public-key encryption scheme and stores the ciphertext ct in the server. At the query phase, given a query tag d, the user encrypts the monomial X − with the RGSW symmetric encryption scheme and sends the ciphertext CT Q to the server. We assume that the RGSW encryption scheme has the same secret key sk as the one of RLWE encryption scheme. Given two ciphertexts CT Q←RGSW.Enc(X −) and ct ←RLWE.Enc(DB(X)), the server first performs their multiplication to obtain an ciphertext, denoted by . It follows from the previous section that c t is a valid RLWE encryption of the polynomial Since we use the cyclotomic polynomial ϕ (X)=X +1 of power-of-two degree, the polynomial ring has the property X =−1. Thus, for any tag d, the constant term of the polynomial DB(X)·X − is α if there is some index i satisfying d=d , otherwise zero. Now the server applies the RLWE.Conv algorithm on c t to compute an LWE encryption ct of this constant term. This conversion procedure not only prevents the leakage of information that has not been queried but also reduces the size of output ciphertext by half. In addition, the (optional) modulus-switching procedure can be considered to get a ciphertext c t with a smaller modulus size and reduce the communication cost. Finally the user decrypts this LWE ciphertext and gets the desired value α or zero (⊥). Algorithm 1 summarizes the procedure of secure search-and-extraction. Our method can be modified to support a secure comparison of data values using a hash (one-way) function. If hashed values of α are used as polynomial coefficients, our method will return a hashed value of α to the user instead of α . The user may check whether the resulting value and the hashed query value are the same or not without knowing information about database.

Comparison with related work

Equality test has been traditionally considered difficult to perform on homomorphic encryption, because of its large circuit depth [7, 15, 16]. They evaluate the equality test on each encrypted tuple of database, so at least Ω(n) homomorphic operations are required for searching on database of size n. In addition, Boneh et al. [17] does not protect the database information to the users, that is, the whole database can be recovered by the resulting ciphertext of a query. However, our method is very efficient in parameter size and complexity since it requires only a single hybrid multiplication. One limitation of this method is that the tags d should be bounded by ciphertext dimension N to construct the encoding polynomial DB(X). Since the dimension N has a significant influence on the performance of HE scheme, too large value of N has an impractical impact on the performance. In the next section, we will describe how to overcome this problem in terms of the application to genomic data.

Secure searching of biomarkers

We return to our main goal of task3: secure outsourcing matching of a set of biomakers to encrypted genomes. We describe how to encode and encrypt the genotype information of VCF file in order to apply the privacy-preserving database searching and extraction. VCF file contains multiple genotype information lines, where each of them consists of a triple (ch ,pos , SNPs ) of chromosome number, position, and a sequence of SNP alleles. A chromosome identifier ch ranges from 1 to 22, X, and Y. A non-negative integer pos represents the reference position with the first base having position 1, and SNPs is a reference or alternate sequence in {A,T,G,C}∗. A query from user is also a triple of the same form and we aim to decide absence/presence of this biomarker in the database file. We represent the sex chromosomes X and Y as 0 and 23, respectively. Then we define an encoding function by In the following, we describe how to encode the SNPs. For convenience we set the upper bound for the length of SNPs, so let n be the maximal number of reference (or alternate) alleles to be compared between the query genome and user genome in the target database. Each of SNP is represented by two bits as and then concatenated with each other. Next we pad with 1 to the left of the bit string in order to express the staring position of SNPs. Finally it is zero-padded into a binary string of length ℓ =2·n +1, and we convert it into an integer value, denoted by α . If a single nucleotide variant at the given locus is not known, then it is encoded as 0-string. For example, ‘GC’ is encoded as a bit string 1|10|11, which will be represented as an integer 1|10|11(2)=27. Now consider the case that we wish to encode the reference and alternate alleles together. Let and denote the integer encodings of n reference alleles and n alternate alleles, respectively. Then we define an encoding α by the concatenation of two encodings, i.e., as an integer. Table 1 shows the format of database file and illustrates some examples of encoded genomic data.

Table 1

The format of genome data and its encoding with n =10

CHROM	POS	d	REF	ALT	α
1	161235340	3869648161	G	A	12582916
1	161235596	3869654305	C	T	14680069
1	161235657	3869655769	G	T	12582917
1	161235981	3869663545	G	A	12582916
1	161237503	3869700073	·	TTTTTGT	21849
1	161237891	3869709385	G	A	12582916
1	161238009	3869712217	G	·	12582912
1	161238488	3869723713	A	G	8388614
1	161238683	3869728393	G	A	12582916
1	161238856	3869732545	T	·	10485760
1	161239028	3869736673	AG	·	37748736
1	161239142	3869739409	A	G	8388614
1	161239346	3869744305	G	T	12582917
1	161239470	3869747281	C	T	14680069
1	161239788	3869754913	·	AA	16
1	161239978	3869759473	C	T	12582917
1	161240641	3869775385	TGAT	·	740294656

The format of genome data and its encoding with n =10 A database file is encoded as a set of pair (d ,α ) for i=1,…,n such that and α is the encoded integer of the i-th SNP allele string. Then the encodings d and α are regarded as data-tag and value attribute, respectively. The data user constructs a polynomial such that The user encrypts the polynomial with the RLWE public-key encryption scheme as described above. The query genes are also encoded as a pair of integers (d,α), however, we consider only the information of d is encrypted using the RGSW symmetric encryption scheme, that is, the user encrypts the monomial X −.

Results and discussion

In this section, we explain how to set the parameters and describe our optimization techniques for the implementation. We also present our results using the techniques. The dataset was randomly selected from Personal Genome Project. Our implementation is publicly available on github [18].

How to set parameters

Since all the matching computation is performed on encrypted data in the cloud, the security against a semi-honest adversary follows from the semantic security of the underlying HE scheme. The security of the homomorphic encryption scheme relies on the hardness of the RLWE assumption. We derive a lower-bound on the ring dimension as to get λ-bit security level from the security analysis of [11]. Given the ciphertext modulus Q, it follows from the estimation of noise growth during evaluations [12] and decryption condition that we get the upper bound on the plaintext modulus t to ensure the correctness of decryption after computation. So we set t as the largest power-of-two integer less than the upper bound. If the encodings of the allele strings are too large, we divide them into smaller integers so that each of them is smaller than t. Then we repeat the algorithm to construct the corresponding polynomials of each integer.

Optimization techniques

As we mentioned before, the ring dimension N needs to be larger than the encoded integers d ’s. However, the encoded integers d from VCF files have bits size about 32, while a dimension N with about 11≤ log2N≤16 is considered appropriate for implementation of HE schemes to achieve both security and efficiency. Hence direct application of our method to the VCF file would yield an impractical result. For compression of tag data and its re-randomization, we make the use of a pseudo random number generator H(·) which transforms a tag d into a pair of two non-negative integers and less than N. Our implementation adopts SHA-3 and extracts log2N=11 bits of the hashed value for each of and . We construct two polynomials by the Algorithm 2. Note that for any 1≤i≤n and , the pair of constructed polynomials DB ∗ and DB satisfy . The procedure of database encoding for secure search of biomarkers is described in Algorithm 2. Let and denote the ciphertexts of the polynomials DB ∗ and DB , respectively. Similarly, given the query encoding d, the user computes its randomized value H(d)=(d ∗,d ) and encrypts the two polynomials and . We denote the ciphertexts by and . The server computes the hybrid multiplication to obtain the ciphertexts Now let ct denote the ciphertext computed by the homomorphic addition between and . Finally the server converts it into an LWE ciphertext and performs the modulus-switching procedure as described above. The Algorithm 3 describes the procedure of secure search-and-extraction using our proposed optimization techniques.

Implementation results

The use of variable type ‘int32 _t’ accelerates the speed of implementations and basic C++ std libraries, so we set Q=232 as the ciphertext modulus. We also set t=211 as the modulus parameter of the plaintext space to ensure the correctness for the output ciphertext. We take the following parameters for Gadget matrix G: B g=128 and d g=5, so that they satisfy the condition . Each coefficient of the secret key sk is chosen at random from {0,±1} and we set 64 as the number of nonzero coefficients in the secret key. As in the work of [12], we considered the Gaussian distribution of standard deviation σ=1.4 to sample random error polynomials. For the efficiency of homomorphic multiplication, we also used the optimized library for complex FFT, i.e., the Fast Fourier Transform in the West [19]. That is, we use the complex primitive 2N-th root of unity rather than a primitive root in a prime field of order Q. We measure a running time of 0.804 s to set up the FFT environment at dimension 2N=212. The key generation of two schemes takes about 0.247 ms in total. Table 2 presents the time complexity and storage for the evaluation of secure searching of biomarkers. All the experiments were performed on a single Intel Core i5 running at 2.9 GHz processor. The chosen parameters provide λ=128 bits of security level.

Table 2

Implementation results of secure searching of biomarkers

DB size	n _SNP	Complexity				Storage
		Query- enc	DB- enc	Eval	Dec	Query	DB	Result
10 K	2	3.247 ms	3.563 ms	0.018 s	0.004 ms	160 KB	3 MB	0.75 MB
	5		7.212 ms	0.039 s	0.011 ms		6 MB	1.5 MB
	10		14.813 ms	0.079 s	0.027 ms		12 MB	3 MB
100 K	2		21.424 ms	0.111 s	0.034 ms		17 MB	4.25 MB
	5		42.415 ms	0.227 s	0.064 ms		34 MB	8.5 MB
	10		99.921 ms	0.454 s	0.139 ms		68 MB	17 MB
4 M	2		0.745 s	3.954 s	1.171 ms		593 MB	148 MB
	5		1.506 s	7.911 s	1.949 ms		1185 MB	296 MB
	10		3.001 s	15.442 s	3.795 ms		2370 MB	593 MB

Implementation results of secure searching of biomarkers

Conclusions

In this work, we suggested an efficient method to securely search the query tag and extract the corresponding value from a database over hybrid GSW homomorphic encryption scheme. We came up with a solution to the secure outsourcing matching problem by using polynomial encoding and extraction of desired value based on the multiplication of an RGSW ciphertext and an ordinary RLWE ciphertext. And then we applied this method to find a set of biomarkers in DNA sequences. Our solution shows the progress of cryptographic techniques in terms of their capability can support real-world genome data analysis in a cloud environment. We list a few fascinating open problems to remain. First, we only considered the semi-honest adversary model in this work. Other tools such as homomorphic authenticated scheme may lead to more efficient protocols in the malicious settings. Another issue is to support k multiple queries while maintaining the performance and communication cost less than k times of a single query case. We expect to have much faster performance by enabling a batching method.

3 in total

Review 1. Routes for breaching and protecting genetic privacy.

Authors: Yaniv Erlich; Arvind Narayanan
Journal: Nat Rev Genet Date: 2014-05-08 Impact factor: 53.242

2. Privacy in the Genomic Era.

Authors: Muhammad Naveed; Erman Ayday; Ellen W Clayton; Jacques Fellay; Carl A Gunter; Jean-Pierre Hubaux; Bradley A Malin; Xiaofeng Wang
Journal: ACM Comput Surv Date: 2015-09 Impact factor: 10.282

3. Private genome analysis through homomorphic encryption.

Authors: Miran Kim; Kristin Lauter
Journal: BMC Med Inform Decis Mak Date: 2015-12-21 Impact factor: 2.796

3 in total

Review 1. Privacy-preserving techniques of genomic data-a survey.

Authors: Md Momin Al Aziz; Md Nazmus Sadat; Dima Alhadidi; Shuang Wang; Xiaoqian Jiang; Cheryl L Brown; Noman Mohammed
Journal: Brief Bioinform Date: 2019-05-21 Impact factor: 11.622

Review 2. A community effort to protect genomic data sharing, collaboration and outsourcing.

Authors: Shuang Wang; Xiaoqian Jiang; Haixu Tang; Xiaofeng Wang; Diyue Bu; Knox Carey; Stephanie Om Dyke; Dov Fox; Chao Jiang; Kristin Lauter; Bradley Malin; Heidi Sofia; Amalio Telenti; Lei Wang; Wenhao Wang; Lucila Ohno-Machado
Journal: NPJ Genom Med Date: 2017-10-27 Impact factor: 8.617

3. Semi-Parallel logistic regression for GWAS on encrypted data.

Authors: Miran Kim; Yongsoo Song; Baiyu Li; Daniele Micciancio
Journal: BMC Med Genomics Date: 2020-07-21 Impact factor: 3.063

3 in total