Literature DB >> 35473587

Efficient privacy-preserving variable-length substring match for genome sequence.

Yoshiki Nakagawa¹, Satsuya Ohata², Kana Shimizu^3,4.

Abstract

The development of a privacy-preserving technology is important for accelerating genome data sharing. This study proposes an algorithm that securely searches a variable-length substring match between a query and a database sequence. Our concept hinges on a technique that efficiently applies FM-index for a secret-sharing scheme. More precisely, we developed an algorithm that can achieve a secure table lookup in such a way that [Formula: see text] is computed for a given depth of recursion where [Formula: see text] is an initial position, and V is a vector. We used the secure table lookup for vectors created based on FM-index. The notable feature of the secure table lookup is that time, communication, and round complexities are not dependent on the table length N, after the query input. Therefore, a substring match by reference to the FM-index-based table can also be conducted independently against the database length, and the entire search time is dramatically improved compared to previous approaches. We conducted an experiment using a human genome sequence with the length of 10 million as the database and a query with the length of 100 and found that the query response time of our protocol was at least three orders of magnitude faster than a non-indexed database search protocol under the realistic computation/network environment.

Entities: Chemical

Keywords: FM-index; LCP array; Maximal exact match; Private genome sequence search; Secret sharing; Secure multiparty computation; Suffix array

Year: 2022 PMID： 35473587 PMCID： PMC9040336 DOI： 10.1186/s13015-022-00211-1

Source DB: PubMed Journal: Algorithms Mol Biol ISSN： 1748-7188 Impact factor: 1.721

Introduction

The dramatic reduction in the cost of genome sequencing has prompted increased interest in personal genome sequencing over the last 15 years. Extensive collections of personal genome sequences have been accumulated both in academic and industrial organizations, and there is now a global demand for sharing the data to accelerate scientific research [1, 2]. As discussed in previous studies, disclosing personal genome information has a high privacy risk [3], so it is crucial to ensure that individuals’ privacy is protected upon data sharing. At present, the most popular approach for this is to formulate and enforce a privacy policy, but it is a time-consuming process to reach an agreement, especially among stakeholders with different legal backgrounds, which slows down the pace of research. Therefore, there is a strong demand for privacy-preserving technologies that can potentially compensate for or even replace the traditional policy-based approach [4, 5]. One important application that needs a privacy-preserving technology is private genome sequence search, where different stakeholders respectively hold a query sequence and a database sequence and the goal is to let the query holder know the result while simultaneously keeping the query and the database private. Many studies have addressed the problem of how to compute exact or approximate edit distance or the longest common substring (LCS) through techniques based on homomorphic encryption [6-8] and secure multi-party computation (MPC) [9-15], or how to compute sequence similarity based on private set intersection [16]. While these studies can evaluate global sequence similarity for two sequences of similar length, other studies address the problem of finding a substring between a query and a long genome sequence or a set of long genome sequences, with the aim of evaluating local sequence similarity [17-23]. Shimizu et al. proposed an approach to combine an additive homomorphic encryption and index structures such as FM-index [24] and the positional Burrows-Wheeler transform [25] to find the longest prefix of a query that matches a database (LPM) and a set-maximal match for a collection of haplotypes [17]. Sudo et al. used a similar approach and improved the time and communication complexities for LPM on a protein sequence by using a wavelet matrix [19]. Ishimaki et al. improved the round complexity of a set-maximal match, though the search time was more than one order of magnitude slower than [17] due to the heavy computational cost caused by the fully homomorphic encryption [18]. Sotiraki et al. used the Goldreich-Micali-Wigderson protocol to build a suffix tree for a set-maximal match [20]. According to experiments by [21], the search time of [20] is one order of magnitude slower than [17, 21]. Mahdi et al. [21] used a garbled circuit to build a suffix tree for substring match and a set-maximal match under a different security assumption such that the tree-traversal pattern is leaked to the cloud server. Chen et al. [22] and Popic et al. [23] found fixed-length substring matches using a one-way hash function or homomorphic encryption on a public cloud under a security assumption such that the database is a public sequence and a query is leaked to a private cloud server. In this study, we aim to improve privacy-preserving substring match under the security assumption such that both the query and the database sequence are strictly protected. We first propose a more efficient method for finding LPM, and then extend it to find the longest maximal exact match (LMEM), which is more practically important in bioinformatics. We designed the protocol for LMEM for ease of explanation, and the protocol can be applied to similar problems such as finding all maximal exact matches (MEMs) with a small modification. To our knowledge, this is the first study to address the problem of securely finding MEMs.

Our contribution

The time complexity of the previous studies [17, 19] include the factor of , and thus they do not scale well to a large database. For a similar reason, using secure matching protocols (e.g., [26]) for the shares (or tags in searchable encryption) of all substrings in a query and database is even worse in terms of time complexity. To achieve a real-time search on an actual genome database, we propose novel secret-sharing-based protocols that do not include the factor of in the time, communication, and round complexities for the search time (i.e., the time after the input of a query until the end of the search). The basic idea of the protocols is to represent the database string by a compressed index [24, 27] and store the index as a lookup table. LPM and MEMs are found by at most and table lookups respectively, where is the length of the query. More specifically, the table is referenced in a recursive manner; i.e., one needs to obtain , where , given i. To ensure security, we need to compute without seeing any element of . The key technical contribution of this study is an efficient protocol that achieves this type of recursive reference. We named the protocol secret-shared recursive oblivious transfer (ss-ROT). While the previous studies require time complexity to ensure security, the time, communication, and round complexities of ss-ROT are all for recursive table lookups, except for the preparation of the table and generation of shares before the query input. Since the entire protocols mainly consist of table lookups for LPM, and table lookups and inner product computations for LMEM, the search times for LPM and LMEM do not depend on the database size. In addition to the protocols based on ss-ROT, we developed a protocol to reduce data transfer size in the initial step by using a similar approach taken in ss-ROT. The protocol offers a reasonable trade-off between the amount of reduction in data transfer in the initial step and the increase in computational cost in the later step. We implemented the proposed protocol and tested it on substrings of a human genome sequence to in length and confirmed that the actual CPU time and data transfer overhead were in good agreement with the theoretical complexities. We also found that the search time of our protocol was three orders of magnitude faster than that of the previous method [17, 19]. For conducting further performance analysis, we designed and implemented baseline protocols using major techniques of secret-sharing-based protocols. The results showed that the search times of our protocols were at least two orders of magnitude faster than those of the baseline protocols.

Preliminaries

Secure computation based on secret sharing

Here, we explain the 2-out-of-2 additive secret sharing ((2, 2)-SS) scheme and how to securely compute arithmetic/Boolean gates (Fig. 1).

Fig. 1

Arithmetic addition and multiplication over secret sharing

Arithmetic addition and multiplication over secret sharing Secret sharing and secure computation In t-out-of-n secret sharing (e.g., [28]), we split the secret value x into n pieces, and can reconstruct x by combining more or an equal number of t pieces. We call the split pieces “share”. The basic security notion for secret sharing is that we cannot obtain any information about x even if we gather less than or equal to shares. In this paper, we consider a case with . A 2-out-of-2 secret sharing ((2, 2)-SS) scheme over consists of two algorithms: and . takes as input and outputs , where the bracket notation denotes the arithmetic share of the i-th party (for ). We denote as their shorthand. takes as inputs and and outputs x. For arithmetic sharing and Boolean sharing , we consider power-of-two integers n (e.g., ) and , respectively. Depending on the secret sharing scheme, we can compute arithmetic/Boolean gates over shares; that is, we can execute some kind of processing related to x without x. This means it is possible to perform some computation without violating the privacy of the secret data, and is called secure (multi-party) computation. It is known that we can execute arbitrary computation by combining basic arithmetic/Boolean gates. In the following paragraphs, we show how to concretely compute these gates over shares. Secure subprotocols used in this paper Semi-honest secure two-party computation based on (2, 2)-Additive SS We use a standard (2, 2)-additive SS scheme, defined byNote that one of the shares of x ( or ) does not reveal any information about x. In Fig. 1, the secret value is split into and . These are valid (2, 2)-additive shares because holds. Even if we can see , we cannot decide the value of x since we execute a split of x uniformly at random. This means, in Fig. 1, computing nodes and cannot obtain any information about x as long as these two nodes do not collude. On the other hand, we can compute arithmetic gates over shares as follows: randomly choose and let and . output . can be done locally by just adding each party’s share on x and on y. In Fig. 1 (left), we show an example of secure addition. obtain shares 6/7 by adding their two shares. In this process, cannot find they are computing . Multiplication is more complex than addition. There are various methods for multiplication over shares, most of which require communication between computing nodes. In this paper, we use the standard method for based on Beaver triples (BT) [29]. Such a triple consists of and such that . Hereafter, a, b, and c denote , , and , respectively. We use these BTs as auxiliary inputs for computing . Note that we can compute them in advance (or in offline phase) since they are independent of inputs and . We adopt a trusted initializer setting (e.g., [30, 31]); that is, BTs are generated by the party other than two computing nodes and then distributed. In the online phase of , each i-th party () can compute the multiplication share as follows: first computes and , and sends them to . reconstructs and . computes , and computes . Here, and calculated with the above procedures are valid shares of xy; that is, . We shorten the notations and write the and protocols simply as and , respectively. We also write as . Note that, similarly to the protocol, we can also locally compute multiplication by constant c, denoted by . We can easily extend the above protocols to Boolean gates. By converting and − into in the arithmetic and protocols, we can obtain the and protocols, respectively. We can construct and protocols from the properties of these gates. When we compute , and output and , respectively. When we compute , we compute . We shorten the notations and write , , , and simply as , , , and , respectively. By combining the above gates, we can securely compute higher-level protocols. The functionality of the secure subprotocols [15] used in this paper are shown in Table 1. Due to space limits, we omit the details of their construction. Note that we can compute by . In this paper, we consider the standard simulation-based security notion in the presence of semi-honest adversaries (for 2PC), as in [32]. We show the definition in Appendix 2. Roughly speaking, this security notion guarantees the privacy of the secret under the condition that computing nodes do not deviate from the protocol; that is, although computing nodes are allowed to execute arbitrary attacks in their local, they do not (maliciously) manipulate transmission data to other parties. The building blocks we adopt in this paper satisfy this security notion. Moreover, as described in [32], the composition theorem for the semi-honest model holds; that is, any protocol is privately computed as long as its subroutines are privately computed.

Table 1

Secure subprotocols used in this paper

	Input	Output
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathsf {Equality}$$\end{document}Equality	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[\![x]\!]$$\end{document}[[x]], \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[\![y]\!]$$\end{document}[[y]]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[\![z]\!]^B$$\end{document}[[z]]B s.t. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$z = 1$$\end{document}z=1 if \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x = y$$\end{document}x=y otherwise \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$z = 0$$\end{document}z=0
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathsf {Comp}$$\end{document}Comp	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[\![x]\!]$$\end{document}[[x]], \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[\![y]\!]$$\end{document}[[y]]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[\![z]\!]^B$$\end{document}[[z]]B s.t. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$z = 1$$\end{document}z=1 if \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$x < y$$\end{document}x<y otherwise \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$z = 0$$\end{document}z=0
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathsf {CastUp}$$\end{document}CastUp	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[\![x]\!] \in \mathbb {Z}_{2^n}$$\end{document}[[x]]∈Z2n, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n'$$\end{document}n′	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[\![x]\!] \in \mathbb {Z}_{2^{n'}}$$\end{document}[[x]]∈Z2n′ (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n < n'$$\end{document}n<n′)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathsf {B2A}$$\end{document}B2A	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[\![x]\!]^B$$\end{document}[[x]]B	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[\![x]\!]$$\end{document}[[x]]
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathsf {Choose}$$\end{document}Choose	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[\![x]\!]$$\end{document}[[x]], \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[\![y]\!]$$\end{document}[[y]], \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[\![e \in \{0,1\}]\!]$$\end{document}[[e∈{0,1}]]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[\![z]\!]$$\end{document}[[z]] s.t. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$z = x$$\end{document}z=x if \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$e = 1$$\end{document}e=1, otherwise (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$e = 0$$\end{document}e=0) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$z = y$$\end{document}z=y

Index structure for string search

Notation and definition denotes a set of ordered symbols. A string consists of symbols in . We denote a lexicographical order of two strings S and by (i.e., A < C < G < T and AAA < AAC). We denote the i-th letter of a string S by S[i] and a substring starting from the i-th letter to the j-th letter by S[i, j]. The index starts with 0. The length of S is denoted by |S|. A reverse string of S (i.e., ) is denoted by . We consider a direction from the i-th position to the j-th position as rightward if and leftward otherwise. Given a query and a database S, we define the longest prefix that matches a database string (LPM) by , where and , and the longest maximal exact match (LMEM) by , where and . FM-Index and related data structures FM-Index [24] and related data structures [27] are widely used for genome sequence search. Given a query string of length and a database string S of length N, [24] enables LPM to be found in time regardless of N, and it also enables LMEM to be found in if auxiliary data structures are used [27]. Given all the suffixes of a string S: , , a suffix array is an array of positions such that . We denote the suffix array of S by SA and denote its i-th element by SA[i]. A Burrows-Wheeler transform (BWT) is a permutation of the sequence S such that its i-th letter becomes . We denote a BWT of S by L and denote its i-th letter by L[i]. Let us define a rank of S for a letter at position t by and a count of occurrences of letters that are lexicographically smaller than c in S by , and the operation . The match between and S is reported as a form of left-closed and right-open interval on SA, and the lower and upper bounds of the interval are respectively computed by . Given a letter c and an interval [f, g) that corresponds to suffixes that share the prefix x (i.e., [f, g) reports the locations of the substring x in S), we can find a new interval that corresponds to all suffixes that share the prefix cx (i.e., locations of the substring cx) byThe leftward extension of the match is called a backward search, which is the main functionality of FM-Index. By starting the search with the initial interval [0, N) and conducting the backward searches for , the longest suffix match is detected when . and are precomputed and stored in an efficient from that can be searched in constant time. Therefore, the longest suffix match can be computed in time. LPM is found if the search is conducted on and match is extended by . Searching LMEM by repeating LPM for takes time. We can improve it to time by using the longest common prefix (LCP) array and related data structures [27]. The LCP array, denoted by , is an array that stores the length of the longest prefix of and in for . The lcp-interval [i, j) of lcp-value d is an interval such that it satisfies , , for all , and for at least one , and is denoted by . corresponds to all the suffixes that share the prefix . The parent interval of is the lcp-interval such that and , and there is no other lcp-interval such that and . The parent of the lcp-interval [f, g) can be found bywhere and . By finding a parent interval using and whenever it fails to extend the match, we can avoid useless backward searches, and thus LMEM is found at most backward searches. , and are precomputed and stored in an efficient form that can be searched in constant time, so we can find LMEM in time. See section 5.2 of [27] for more details of the data structures. Examples of the search by FM-Index, , , and are provided in Appendix 1. Schematic view of our goal and model. (0) Server (DB holder) distributes Beaver triples. (A reliable third party can serve as the trusted initializer instead.) (1) Server distributes shares of the database. (2) User (query holder) distributes shares of the query. (3) The computing nodes jointly calculate shares of the result. (4) The results are sent to User. The offline phase is (0), DB preparation phase is (1), and Search phase consists of (2)–(4) Summary of complexities for our protocols and related protocols BTime and Bsize are generation time and size of BTs. Dtime and Dsize are generation time for the shares of the database and size of the shares. Stime is the time for Search phase. Comm. is the size of data exchanged between computing nodes. Round is the number of data exchanges

Proposed protocols

Problem setting and outline of our protocols

We assume that a query holder , a database holder , and two computing nodes and participate the protocol. holds a query string of length and holds a database string of length . After the protocol is run, only knows LPM or LMEM between and . and do not obtain any information of and , except for and . Our protocol consists of offline, DB preparation, and Search phases. In the offline phase, generates BTs (correlated randomness used for multiplication) and sends them to and . In the DB preparation phase, creates a lookup table and distributes its shares to and . In the Search phase, generates shares of the query and sends them to and , and and jointly compute the result without obtaining any information of the lookup table. Finally, obtains the results. Figure 2 shows the schematic view of our goal and model. Note that the offline and DB preparation phases do not depend on a query string, so they can be computed in advance for multiple queries.

Fig. 2

Schematic view of our goal and model. (0) Server (DB holder) distributes Beaver triples. (A reliable third party can serve as the trusted initializer instead.) (1) Server distributes shares of the database. (2) User (query holder) distributes shares of the query. (3) The computing nodes jointly calculate shares of the result. (4) The results are sent to User. The offline phase is (0), DB preparation phase is (1), and Search phase consists of (2)–(4)

In section "Secret-shared recursive oblivious transfer", we propose the important building block ss-ROT that enables recursive reference to a lookup table. In section "Secure LPM", we describe how to design the lookup table based on FM-Index, and propose an efficient protocol for LPM by using the lookup table and ss-ROT. In section "Secure LMEM", we describe the additional table design for auxiliary data structures, and propose the complete protocol for LMEM. Table 2 summarizes the theoretical complexities of the three protocols. For comparison, the complexities of the baseline protocols and a previous method for LPM based on an additive homomorphic encryption [17, 19] are shown. As we mentioned in section "Introduction", the baseline protocols are designed using major techniques of secret-sharing-based protocols. The detailed algorithms are described in Appendix 3.

Table 2

Summary of complexities for our protocols and related protocols

	Btime	Bsize	Dtime	Dsize	Stime	Comm.	Round
ss-ROT (proposed)	0	0	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell N$$\end{document}ℓN	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell N$$\end{document}ℓN	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell$$\end{document}ℓ	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell$$\end{document}ℓ	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell$$\end{document}ℓ
Secure LPM (proposed)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell$$\end{document}ℓ	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell$$\end{document}ℓ	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell N$$\end{document}ℓN	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell N$$\end{document}ℓN	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell$$\end{document}ℓ	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell$$\end{document}ℓ	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell$$\end{document}ℓ
[17, 19] (LPM by AHE)	−	−	−	−	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell N$$\end{document}ℓN	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell \sqrt{N}$$\end{document}ℓN	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell$$\end{document}ℓ
Baseline LPM	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell ^2 N$$\end{document}ℓ2N	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell ^2 N$$\end{document}ℓ2N	N	N	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell ^2 N$$\end{document}ℓ2N	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell ^2 N$$\end{document}ℓ2N	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log \ell +\log N$$\end{document}logℓ+logN
Secure LMEM (proposed)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell ^2$$\end{document}ℓ2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell ^2$$\end{document}ℓ2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell N$$\end{document}ℓN	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell N$$\end{document}ℓN	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell ^2$$\end{document}ℓ2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell ^2$$\end{document}ℓ2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell$$\end{document}ℓ
Baseline LMEM	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell ^3 N$$\end{document}ℓ3N	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell ^3 N$$\end{document}ℓ3N	N	N	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell ^3 N$$\end{document}ℓ3N	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell ^3 N$$\end{document}ℓ3N	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\log \ell +\log N$$\end{document}logℓ+logN

BTime and Bsize are generation time and size of BTs. Dtime and Dsize are generation time for the shares of the database and size of the shares. Stime is the time for Search phase. Comm. is the size of data exchanged between computing nodes. Round is the number of data exchanges

Secret-shared recursive oblivious transfer

We define a problem called a secret-shared recursive oblivious transfer (ss-ROT) as follows.

Definition 1

We assume a database holder and two computing nodes and participate the protocol. holds a vector V of length and . Given the initial position and the depth of recursion , the secret-shared recursive oblivious transfer protocol outputs shares ofwithout leaking V to and . For simplicity, we denote the recursion of Eq. 3 by (e.g., is denoted by ). In our protocol, all the random values are uniformly generated from . DB preparation phase generates random values and computes the following vectors . Each vector has elements. computes and sends and to and , for and . Search phase The Search phase consists of two steps and is described in Lines 2–5 of Protocol 1. The input is the initial position and shares of R. The output is . An example of a search is illustrated in Fig. 3.

Fig. 3

Example of a search when , , and . The goal is to compute . Here we assume generates . In Step 1 of Search phase, and jointly compute to obtain . ( is randomized by , so any element of V is leaked.) In a similar way, and compute and . In Step 2, and output and respectively. Since , , , and , ss-ROT successfully computes

Security intuition

In the DB preparation phase of ss-ROT, does not disclose any private values, and and receive the shares. In the Search phase, all the messages exchanged between and are shares except for the result of in Step 1. In the j-th step of the loop in Step 1, is reconstructed. Since the reconstructed value is randomized by , no information is leaked. Note that for each vector , all the elements are randomized by the same value , but only one of them is reconstructed, and different random numbers are used for . In Step 2, and output a result, and no information other than the result is leaked.

Security

Theorem 1

ss-ROT is correct and secure in the semi-honest model.

Proof

Correctness and security of ss-ROT protocol are proved as follows. Correctness. We assume the following equation.In Step1, for , the protocol computes by reconstructing . From the definition of in Eq. 4,For , the protocol computes by reconstructing . From the definition of in Eq. 4 and the assumption of Eq. 5,Eq. 5 holds for by Eq. 6. It also holds for under the assumption that Eq. 5 holds for . Therefore by induction, Eq. 5 holds for . In Step 2, and output . Since Eq. 5 holds for ,is transformed into by plugging in . Therefore the final output of ss-ROT becomes . The above argument completes the proof of correctness of Theorem 1. Security. Since the roles of and are symmetric, it is sufficient to consider the case when is corrupted. The input to is and , and output of is . The function achieved by Protocol 1 is deterministic and the protocol is correct. Therefore, to ensure the security of Protocol 1, we need to prove existence of a probabilistic polynomial-time simulator such thatwhere X is ’s view. X consists of:All the messages from and are uniformly at random in , as they are generated by . holds for , and holds. are uniformly at random in from the definition of Eq. 4. for and (a message from ) (j-th message from ) for (j-th value obtained by in Step1) for . Let us denote a random number u chosen from a set uniformly at random by . We construct as described in Protocol 2. The output of is , , and . In Line 6 and Line 9, are generated such that they are uniformly at random in . In Line 7, and are generated by such that they are shares of and uniformly at random in . In Line 10, and are generated by such that they are shares of and uniformly at random in for . In Line 12, and are generated by such that they are shares of and uniformly at random in . All the elements of except for and () are uniformly at random in by Line 3. Therefore, Eq. 8 holds. By the above discussion, we find our ss-ROT satisfies security in the semi-honest model.

Complexities

In the DB preparation phase, generates shares of V of length for times. Therefore, time and communication complexities are . For the Search phase, is computed times in Step 1. Since the time, communication, and round complexities of are O(1), those of the Search phase become .

Secure LPM

Construction of lookup table The goal is to find LPM securely. To apply FM-Index for a prefix search, the reverse string of (i.e., ) is used. The backward search of FM-Index is formulated by Eq. 1. If we precompute for and A,T,G,C, and store them in a lookup table that consists of four vectors: , , , and such that , Eq. 1 is replaced by the following table lookupI.e., starting with the initial interval , we can compute the match by recursively referring to the lookup table while . Protocol overview The key idea of Secure LPM is to refer to V by ss-ROT, i.e., and jointly refer to V times in a recursive manner. To achieve backward search, and need to select for each reference, where x is a query letter to be searched with. This is achieved by expressing the query letter by unary code (Eq. 11 ) and computing the inner product of Eq. 11 and . To find LPM, and need to check for each reference. We use the subprotocol to check it securely. Since V is randomized with different numbers for searching f and g, the difference of the random numbers is precomputed and removed securely upon the equality check. receives only the result of each equality check to know LPM. For example, LPM is the prefix of length when for the i-th reference. If for all references, LPM is the entire query. DB preparation phase creates a lookup table and generates the following vectors in a similar manner to ss-ROT. For simplicity, we denote the length of by . is used for computing the lower bound f of the interval [f, g). We also generate for the upper bound g. R consists of vectors, each of length . Since the longest match is found when , also generates a vector that is used for equality check of f and g. Then, sends shares of , , and to and . Search phase Protocol 3 describes the algorithm in detail. generates four vectors , , , , each of length , as follows.For each j, encodes (e.g., if ). The aim of the encode is to compute when . Figure 4 illustrates an example of the table lookup.

Fig. 4

Example of a secure table lookup when = GCT and = ACGT. Only the lookup for a lower bound is shown. For simplicity, and are denoted by and . () is computed by , and . V is referenced securely by using R. is computed by . is computed by . is computed by generates shares of , , , and distributes them to and . and compute and in Lines 5–8 without leaking and , where corresponds to the match of w[0, j] and . In Lines 10–13, the equality of and is examined for all rounds. Note that different values and are used for and in order to conceal and . Since , , , , it is sufficient to check if is equal to either one of and . In Lines 16–18, receives all the results of equality checks (i.e., ) from and , and knows LPM by reconstructing them. For example, if GCT and , knows that LPM is GC.

Theorem 2

Protocol 3 is correct and secure in the semi-honest model. Correctness and security of Protocol 3 are proved as follows. Correctness. The lookup table V simply stores all possible outputs of . Therefore, backward search (Eq. 1) is equivalent to Eq. 9. For the case of querying w, becomes lower bound f (for ) or upper bound g (for ) of the interval that corresponds to the prefix match of length k. In Line 5 of Protocol 3, is computed. Since and (), it is equivalent to . Line 6 computes in the same manner. Each vector in Eq. 10 is generated in the same manner as in Eq. 4. Since Eq. 10 uses the common random values and for , , , , we can recursively reference (A, C, G, T), which is obvious from the correctness of ss-ROT. Therefore, the recursion by Line 5 and Line 7 can compute , and the recursion by Line 6 and Line 8 can also compute . The longest match is found when the interval width becomes 0. Since and are randomized, Line 11 computes to obtain the correct interval width. When the width is 0, d becomes either one of 0, and . Therefore, Line 12 computes the equality d and 0, and respectively. By reconstructing all the results in Lines 16–18, knows the round, in which the interval width becomes 0; i.e., he/she knows LPM. The above argument completes the proof of correctness of Theorem 2. Security We only show a sketch of the proof. For Lines 1–2 of Protocol 3, and do not disclose any private values, and and receive the shares. For Lines 3–14, it is guaranteed by the subprotocols , , and that all the messages exchanged between and are shares except for the output of in Lines 7–8. (see section "Secure computation based on secret sharing" for details of the subprotocols.) In Lines 7–8, reconstructed values are and . Since the values are and according to Eq. 10, it is obvious that V is randomized for all rounds , and no information is leaked. For Lines 14–17, only the output of at Line 11 is reconstructed. The reconstructed values are either 1 or 0 according to , and no information other than the result is leaked. may reveal by making many queries. Such a problem is called output privacy. Although output privacy is outside of the scope of this paper, we should mention here that needs to make an unrealistically large number of queries for obtaining by such a brute-force attack, considering that is very long. The DB preparation phase generates shares of and ( and ); i.e., vectors of length . Therefore, the time and communication complexities are . For the Search phase, and are computed twice in Lines 4–9 for rounds and is computed once in Lines 10–13 for rounds. Note that is computed in parallel, and the number of round can be reduced to a constant number. Each time, the communication and round complexities of these subprotocols are O(1), so those of the Search phase become .

Secure LMEM

Construction of lookup table As described in section "Index structure for string search", we can find a parent interval by a reference to , , and . Therefore, in addition to defined in section "Secure LPM", we prepare lookup tables that simply store all the outputs of them; i.e., , , and . DB preparation phase generates randomized vectors , and using the same algorithm in section "Secure LPM" for length . As shown in Eq. 2, is referred by the upper and lower bounds of [f, g). Therefore, generates following circular permutations of such that and , and and , are permutated by the same random values, respectively. I.e.,where x is either f or g. is referred by both f and g, and is plugged in to f. Therefore, generates and such that both of them are randomized by , and is permutated by and is permutated by as follows. Similarly, is referred by both f and g, and is plugged in to g. Therefore, generates and as follows. distributes shares of , , , , , , , , and to and . Search phase Protocol 4 describes the algorithm in detail. generates query vectors , , , by Eq. 11 and distributes shares of the vectors to and . In Line 6 of Protocol 4, is computed by the reference to R (i.e., a search based on a backward search) similarly to Lines 5–6 of Protocol 3. In Line 11, is computed by the reference to W (i.e., a search based on , and ). In Line 13, the interval is updated by either or based on the result of in Lines 7–9, where corresponds to the interval that corresponds to a substring match. In each round, we need to know a query letter to be searched with, so we need to maintain the right end position of the match in the query. The position moves toward the right while the match is extended, but remains the same when the interval is updated based on and . To memorize the position, we prepare shares of a unit bit vector u of length , in which the position t is memorized as and . In Lines 20–23, u remains the same if the interval is updated based on and , and otherwise. When the search is finished (e.g., the right end of a match exceeds the right end of the query) . Therefore in Lines 25–28, while the right end of a match dose not exceed the right end of the query and after finishing the search. In Lines 29–31, the inner product of () and u becomes the encode of w[t] that is used for the next round. We also maintain the left end position of the match. While the match is extended, the position remains the same and it moves toward the right when the interval is updated by . The new left end position can be computed by where p is the current position, m is the length of the current match, and c is the lcp-value of (i.e., the longest common prefix length of suffixes contained in ). The position is computed in Line 33. The match length is incremented by 1 for each extension while the right end of the match does not exceed the query length. When the interval is updated by , the match length is reduced to the lcp-value of , which is computed by . The match length is computed in Line 32. In Line 35, the longest match length and the corresponding left end position are updated. After all the positions in the query have been examined, LMEM and its left end position are sent to in Line 37.

Theorem 3

Protocol 4 is correct and secure in the semi-honest model. Correctness and security of Protocol 4 are proved as follows. Correctness. V, R, and q are generated by the same algorithm used in Protocol 3. Therefore, Line 6 is equivalent to a backward search, and e1 is the result of the equality check of 0 and the width of the obtained interval in Lines 7-8. The lookup tables , , and store all the outputs of , and , and , , and are generated based on , , and , respectively. Since and are circular permutations of by the same random values and that are used for generating and respectively, Line 8 can compute and e2 holds the result. By using and e2, either or is selected. and are permutated by and , but are randomized by the identical random value . Similarly, and are permutated by and , but are randomized by . Since and are generated in the same manner as and , it is obvious that the reference by them is correct. The reference by is transformed intoand the reference by is transformed intowhere is any one of , and , and is the corresponding lookup table; i.e., either one of , and . Note that could be a different table for each , but we abuse the same notation for simplicity of notation. Since and are described in the form of and based on Eq. 5, Eq. 12 and Eq. 13 are transformed into and , which also satisfy the recursion form of Eq. 5. Thus, the intervals and are correct intervals and Line 11 is equivalent to computing Eq. 2. Lines 16–23, u remains the same if and otherwise. Therefore Lines 29–31 can choose the letter to be searched with. The match length and the start position are obtained based on e1 in Lines 32–33, and the longest value and the corresponding position are selected in Lines 34–35. The shares of the length and start position of LMEM are sent to , and reconstructs them. Then, Protocol 4 outputs them. The above argument completes the proof of correctness of Theorem 3. Security. We only show a sketch of the proof. For Lines 1–2 of Protocol 4, and do not disclose any private values, and and receive the shares. For Lines 3–37, it is guaranteed by the subprotocols , , , and that all the messages exchanged between and are shares except for the output of in Line 14. (see section "Secure computation based on secret sharing" for details of the subprotocols.) In Line 14, the reconstructed values are and , according to Eq. 5, Eq. 12, and Eq. 13. Since and are randomized by and , respectively, for all rounds , no information is leaked. In Line 38, reconstructs only the search result (the length and start position of LMEM). The DB preparation phase generates shares of and (, ) and and ( and ); vectors of length . Therefore, the time and communication complexities are . For the Search phase, is computed times in parallel in Lines 17–18. (These are not dependent on each other.) In Line 30, is computed times in parallel, and Line 30 is computed in parallel four times in Lines 29–31. Lines 17–18 and Lines 29–31 are repeated for rounds. Other subprotocols are also computed for rounds. The time, communication, and round complexities are O(1) for , and independent computation of for times does not increase the round complexity. The time, communication and round complexities are O(1) for the other subprotocols used in Protocol 4. Therefore, the complexities of the Search phase are for time and communication, and for the number of rounds. The time complexity of the standard (i.e., non-privacy-preserving) LMEM is while that of Secure LMEM is . The increase in time complexity is caused by the computation for maintaining match position securely.

Reducing size of shares in DB preparation phase

The protocols based on ss-ROT are quite efficient in Search phase, however, they require large data transfer from to the computing nodes in DB preparation phase when the number of queries and the length of the database are large. To mitigate the problem, we propose another protocol that can reduce size of shares in DB preparation phase. We use two parameters m and n () for computing shares. When outputs , we denote the share by . When outputs , we denote the share by . We denote . In our protocol, all the random values are uniformly generated from . Basic idea is a lookup table used by Protocol 3 and 4. We sample at , where is the length of and store the sampled values in a vector z. We compute for , where p is the sampled position closest to i and . Given a position k, we can compute by . Any element in z is non-negative and at most while that in x is also non-negative and at most because . Our idea is to use n bits for storing z[i] and m bits for storing x[i]. Note that we used n bits for storing in Protocol 3 and Protocol 4. There are sampled positions, so the size of the lookup table becomes , which is n/m times smaller compared to if M is sufficiently large. We use a rotation technique to hide an intermediate position. Since for most cases, we design a rotated table that satisfies by subtracting an offset from . DB preparation phase computes following vectors for where is a random value, and .

Theorem 4

for .

Proof

Following equation is equivalent to Eq. 14. holds for from the definition of . If , . Therefore, holds for . If and , . Let us consider when and . We denote . Then, and . Since holds because , an offset for and that for are same and . Therefore, holds for . Let be an integer vector of length such thatNote that , and is obtained by adding an offset to . Since is non-negative and at most , generates shares . also generates , , and . Above shares are used for computing lower bound f of an interval. generates shares for upper bound g in a same manner. Then distributes all the shares to and . Search phase generates table for a query string w by Eq. 11. generates shares of q and distributes them to and . The entire protocol is described in Protocol 5.

Security

Theorem 5

Protocol 5 is correct and secure in semi-honest setting.

Proof

Correctness and security of Protocol 5 are proved as follows. Correctness. In Line 5-6 of Protocol 5, is computed. In Line 8, is computed to avoid overflow in Line 9. In Line 9, shares of are computed, which is obvious from the definition of and . In Line 11-13, , and are selected. From the definition of described in Eq. 14, it is obvious that is obtained by when and when , and Line 14 computes . g is computed similarly to f. Since reference to achieved in Lines 4–16 is equivalent to evaluating Eq. 1 and an equality check of is conducted in Lines 17–19, Protocol 5 is correct. Security We only show sketch of the proof. All the messages exchanged between and are shares except for Line 6. In Line 6, reconstructed value is randomized by in Line 5. Therefore, no information is leaked.

Complexities

In DB preparation phase, shares of are generated with a parameter m and shares of other values including are generated with a parameter n. The length of is and that of is . The total number of other values do not depend on N. The query length is and shares of , , and other values are necessary for each query character. Therefore, time complexity is and communication complexity is . For Search phase, , , , and are computed a few times for times in Line 4-16 and is computed times in Line 17-19. Since each time and communication and round complexities of these subprotocols are O(1), those of the entire protocol become .

Experiment

We implemented Protocol 3 (Secure LPM), Protocol 4 (Secure LMEM) and Protocol 5. For comparison, we also implemented baseline protocols (Baseline LPM and Baseline LMEM). Details of the baseline protocols are provided in Appendix 3. All protocols were implemented by Python 3.5.2. The dataset was created from Chromosome 1 of the human genome. We extracted substrings of length , , , , and for databases, and , 25, 50, 75, and 100 for queries. was run with and for and in the proposed protocols, and for a Boolean share and for an arithmetic share in the baseline protocols. We did not implement a data transfer module, and each protocol is implemented as a single program. Therefore, the search time of the protocols was measured by the time consumed by either one of and . To assess the influence of communication on a realistic environment, we theoretically estimated delays caused by network bandwidth and latency. We assume three environments: LAN (0.2 ms/10 Gbps), WAN (10 ms/100 Mbps), and WAN (50 ms/10 Mbps). During the run of Search phase, we stored all the data that were transferred from to in a file and measured the file size as an actual communication size. Note that the communication is symmetric and data transfer size from to is equal to that from to . Based on the data transfer size D byte, we estimate the communication delay by , where k is bandwidth, e is latency and T is a round of communication. All the protocols were run with a single thread on the same machine equipped with Intel Xeon 2.2 GHz CPU and 256 GB memory. We also tested the C++ implementation of [19], which is based on AHE. The algorithm for LPM in [17] for the string with (e.g., genome sequence) is the same as [19]. Sudo et al. [19] is implemented as a server-client software, and the client and the server were run with individual single threads on the same machine. Therefore, the results of [19] do not include delays caused by bandwidth limitation and latency, so we also estimated delays based on the data transfer size and round of communication in the same manner. Each run of the program was terminated if the total runtime of all phases exceeded 20 h. Offline time (Time), offline size (Size), DB preparation time (Time), DB preparation size (Size), Search time on a local machine (Time), Search communication size (Size), estimated Search time for three environments: LAN (0.2 ms/10 Gbps), WAN (10 ms/100 Mbps), and WAN (50 ms/10 Mbps), for (only for Baseline LMEM), , and The size unit is MB and the time unit is s except for the cell describing “20 h<” Estimated time (actual search time on a local machine + estimated data-transfer time) for various N Estimated time (actual search time on a local machine + estimated data-transfer time) for various

Comparison to baseline protocols

Table 3 shows the offline time and size, DB preparation time and size, and Search time and communication size for , and . It also shows the result of Baseline LMEM for , as the runs for did not finish within 20 h. The Search times and communication sizes of Secure LPM and Secure LMEM are several orders of magnitudes faster and smaller than those of Baseline LPM and Baseline LMEM. Since the round and communication complexities of the proposed protocols do not depend on N, their estimated Search time remains small even on WAN environments. Figure 5 shows the estimated Search time on WAN for and . The times of Secure LPM and Secure LMEM do not increase, while those of the baseline protocols increase linearly to N. Figure 6 shows the estimated Search time on WAN for for . We can not show the results of Baseline LMEM because none of its runs were finished within the time limit. As shown in the graph, the time of Secure LPM increases linearly to and that of Baseline LPM increases proportionally to , which are in good agreement with the theoretical complexities in Table 2. According to the graph, the time of Secure LMEM also increases linearly to though its time and communication complexities are . This is because the CPU times are much smaller than the delays caused by network latency that are influenced by the round complexity .

Table 3

Offline time (Time), offline size (Size), DB preparation time (Time), DB preparation size (Size), Search time on a local machine (Time), Search communication size (Size), estimated Search time for three environments: LAN (0.2 ms/10 Gbps), WAN (10 ms/100 Mbps), and WAN (50 ms/10 Mbps), for (only for Baseline LMEM), , and

	N	Offline		DB preparation		Search		Estimated timeon network
	N	Time	Size	Time	Size	Time	Size	LAN	WAN\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_1$$\end{document}1	WAN\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_2$$\end{document}2
Secure	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^5$$\end{document}105	0.166	0.013	123	305	0.141	0.010	0.181	2.162	10.249
LPM	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^6$$\end{document}106	0.141	0.013	1248	3051	0.113	0.010	0.153	2.134	10.221
(proposed)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^7$$\end{document}107	0.150	0.013	12628	30517	0.126	0.010	0.167	2.147	10.234
Secure	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^5$$\end{document}105	2.318	0.162	123	77	2.888	0.040	3.028	9.911	38.020
LPM2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^6$$\end{document}106	2.317	0.162	1236	774	2.878	0.040	3.018	9.901	38.010
(proposed)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^7$$\end{document}107	2.342	0.162	12387	7748	2.939	0.040	3.079	9.962	38.071
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^5$$\end{document}105	–	–	–	–	691	163	691	707	838
[19]	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^6$$\end{document}106	–	–	–	–	7817	517	7818	7863	8261
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^7$$\end{document}107	–	–	–	–	20 h<	–	–	–	-
Baseline (LPM)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^5$$\end{document}105	3995	184	0.146	0.095	13	122	13	24	118
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^6$$\end{document}106	38767	1841	1.522	0.954	164	1227	165	268	1196
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^7$$\end{document}107	20 h<	–	–	–	–	–	–	–	–
Secure	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^5$$\end{document}105	7.619	1.704	435	1068	4.817	0.999	5.577	42.900	195.654
LMEM	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^6$$\end{document}106	7.882	1.704	4467	10681	4.926	0.999	5.686	43.009	195.763
(proposed)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^7$$\end{document}107	8.457	1.704	46384	106811	5.740	0.999	6.501	43.824	196.578
Baseline	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^4$$\end{document}104	12747	611	0.015	0.010	46	407	46	80	389
(LMEM)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^5$$\end{document}105	20 h<	–	–	–	–	–	–	–	–

The size unit is MB and the time unit is s except for the cell describing “20 h<”

Fig. 5

Estimated time (actual search time on a local machine + estimated data-transfer time) for various N

Fig. 6

Estimated time (actual search time on a local machine + estimated data-transfer time) for various

We have preliminary results for testing Secure LPM and Baseline LPM on the actual network (10 ms/100 Mbps). The results were 40 s for Secure LPM and 1739 s for Baseline LPM when . Though both of the preliminary implementations have room for improvement in the performance of data transfer, the results also indicate that our protocol outperforms the baseline protocol and the previous study. The time and size of Secure LPM and Secure LMEM are several orders of magnitude better than those of the baseline protocols for the offline phase, and vice versa for the DB preparation phase. The total time of the offline and DB preparation phases of our protocols are more than one order magnitude faster than that of baseline protocols. For the total size of the offline and DB preparation phases, Secure LMEM was better than Baseline LMEM, but Baseline LPM was better than Secure LPM though the complexity is better for Secure LPM. This is because the majority of the shares were Boolean in the baseline protocols, while all of the shares were arithmetic in the proposed protocols.

Comparison to [19]

[19] is a two-party MPC based on AHE. Each homomorphic operation is time consuming and has no offline and DB preparation phases. As shown in Table 3, the Search time of Secure LPM is four orders of magnitude faster than [19] for . Since time complexity of [19] includes a factor of N, the difference in Search time becomes greater as N becomes large. Moreover, our protocols have a further advantage in communication for a query response when the network environment is poor, as the round complexity of [19] and our protocols are the same while [19] requires communication size. The entire runtimes including all the phases are still six times faster for and . We can compute LMEM by examining [19] for all the positions in a query string, but this approach consumed 3406 s and 2.6 GByte of communication for .

Result of the approach in section "Reducing size of shares in DB preparation phase"

We also implemented Protocol 5 (Secure LPM2) to investigate a trade-off between reduction of the size of shares in DB preparation phase and increase in search time and communication overhead in Search phase. We used the same programming language (i.e., Python 3.5.2) for the implementation and used the same datasets. was run with when generating the arithmetic shares of R. For the generation of rest of the arithmetic shares, was run with and for and . (i.e., , (), and () for the notation used in section "Reducing size of shares in DB preparation phase"). The results are shown in Table 3. The total size of shares in DB preparation phase was 7.7GB for Protocol 5 and 30.5GB for Protocol 3, which is in good agreement with the theoretical complexities discussed in section "Reducing size of shares in DB preparation phase". The search time of Protocol 5 is around 2 s longer than that of Protocol 3. We consider the increase in search time is mainly caused by using rather costly subprotocols: , and more times, which also increases the number of communication rounds. Although the increase in search time, Protocol 5 is still more than two orders of magnitude faster than Baseline LPM and three orders of magnitude faster than [19], so we consider that Protocol 5 offers a reasonable trade-off between performance in DB preparation phase and Search phase.

Discussion

As clearly shown by the results, Search time of the proposed protocols are significantly efficient. Considering the importance of query response time for real applications, it is realistic to reduce Search time at the cost of DB preparation time. Since the total times for offline and DB preparation phases of the proposed protocols were significantly better than those of the well-designed baseline protocols, we consider the trade-off between Search and DB preparation times in our approach to be efficient. For further reduction of DB preparation time, parallelizing the share generation is a feasible approach. Regarding the DB preparation phase, the data transfer between the server and the computing nodes is problematic when the number of queries and the length of the database are large. To mitigate the problem, we also proposed the approach that uses arithmetic shares of a shorter bit length, which offers a reasonable trade-off between the reduction of data size in DB preparation phase and the increase in time and communication overhead in Search phase. Another solution that potentially mitigate the problem is to use an AES-based random number generation that is similar to the technique used in [33]. To explain it briefly, when the server needs to distribute a share of x, (1) the server and generate the same randomness r using a pre-shared key and a pseudorandom function, and (2) the server computes and sends it to . Although ’s computation cost increases, we can remove the data transfer from the server to . In our protocols, the generation of shares in the DB preparation phase cannot be outsourced because they are dependent on the database. Designing an efficient algorithm to outsource the share generation is an important open question.

10 in total

1. Secure Wavelet Matrix: Alphabet-Friendly Privacy-Preserving String Search for Bioinformatics.

Authors: Hiroki Sudo; Masanobu Jimbo; Koji Nuida; Kana Shimizu
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2018-03-08 Impact factor: 3.710

2. The Matchmaker Exchange: a platform for rare disease gene discovery.

Authors: Anthony A Philippakis; Danielle R Azzariti; Sergi Beltran; Anthony J Brookes; Catherine A Brownstein; Michael Brudno; Han G Brunner; Orion J Buske; Knox Carey; Cassie Doll; Sergiu Dumitriu; Stephanie O M Dyke; Johan T den Dunnen; Helen V Firth; Richard A Gibbs; Marta Girdea; Michael Gonzalez; Melissa A Haendel; Ada Hamosh; Ingrid A Holm; Lijia Huang; Matthew E Hurles; Ben Hutton; Joel B Krier; Andriy Misyura; Christopher J Mungall; Justin Paschall; Benedict Paten; Peter N Robinson; François Schiettecatte; Nara L Sobreira; Ganesh J Swaminathan; Peter E Taschner; Sharon F Terry; Nicole L Washington; Stephan Züchner; Kym M Boycott; Heidi L Rehm
Journal: Hum Mutat Date: 2015-10 Impact factor: 4.878

Review 3. Privacy-preserving techniques of genomic data-a survey.

Authors: Md Momin Al Aziz; Md Nazmus Sadat; Dima Alhadidi; Shuang Wang; Xiaoqian Jiang; Cheryl L Brown; Noman Mohammed
Journal: Brief Bioinform Date: 2019-05-21 Impact factor: 11.622

Review 4. Routes for breaching and protecting genetic privacy.

Authors: Yaniv Erlich; Arvind Narayanan
Journal: Nat Rev Genet Date: 2014-05-08 Impact factor: 53.242

5. Privacy in the Genomic Era.

Authors: Muhammad Naveed; Erman Ayday; Ellen W Clayton; Jacques Fellay; Carl A Gunter; Jean-Pierre Hubaux; Bradley A Malin; Xiaofeng Wang
Journal: ACM Comput Surv Date: 2015-09 Impact factor: 10.282

6. Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT).

Authors: Richard Durbin
Journal: Bioinformatics Date: 2014-01-09 Impact factor: 6.937

7. Efficient privacy-preserving string search and an application in genomics.

Authors: Kana Shimizu; Koji Nuida; Gunnar Rätsch
Journal: Bioinformatics Date: 2016-03-02 Impact factor: 6.937

8. A hybrid cloud read aligner based on MinHash and kmer voting that preserves privacy.

Authors: Victoria Popic; Serafim Batzoglou
Journal: Nat Commun Date: 2017-05-16 Impact factor: 14.919

9. Privately computing set-maximal matches in genomic data.

Authors: Katerina Sotiraki; Esha Ghosh; Hao Chen
Journal: BMC Med Genomics Date: 2020-07-21 Impact factor: 3.063

10. Federated discovery and sharing of genomic data using Beacons.

Authors: Marc Fiume; Miroslav Cupak; Stephen Keenan; Jordi Rambla; Sabela de la Torre; Stephanie O M Dyke; Anthony J Brookes; Knox Carey; David Lloyd; Peter Goodhand; Maximilian Haeussler; Michael Baudis; Heinz Stockinger; Lena Dolman; Ilkka Lappalainen; Juha Törnroos; Mikael Linden; J Dylan Spalding; Saif Ur-Rehman; Angela Page; Paul Flicek; Stephen Sherry; David Haussler; Susheel Varma; Gary Saunders; Serena Scollen
Journal: Nat Biotechnol Date: 2019-03 Impact factor: 54.908

10 in total