Ardalan Naseri1, Erwin Holzhauser1, Degui Zhi2, Shaojie Zhang1. 1. Department of Computer Science, University of Central Florida, Orlando, FL, USA. 2. School of Biomedical Informatics and School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA.
Abstract
MOTIVATION: With the wide availability of whole-genome genotype data, there is an increasing need for conducting genetic genealogical searches efficiently. Computationally, this task amounts to identifying shared DNA segments between a query individual and a very large panel containing millions of haplotypes. The celebrated Positional Burrows-Wheeler Transform (PBWT) data structure is a pre-computed index of the panel that enables constant time matching at each position between one haplotype and an arbitrarily large panel. However, the existing algorithm (Durbin's Algorithm 5) can only identify set-maximal matches, the longest matches ending at any location in a panel, while in real genealogical search scenarios, multiple 'good enough' matches are desired. RESULTS: In this work, we developed two algorithmic extensions of Durbin's Algorithm 5, that can find all L-long matches, matches longer than or equal to a given length L, between a query and a panel. In the first algorithm, PBWT-Query, we introduce 'virtual insertion' of the query into the PBWT matrix of the panel, and then scanning up and down for the PBWT match blocks with length greater than L. In our second algorithm, L-PBWT-Query, we further speed up PBWT-Query by introducing additional data structures that allow us to avoid iterating through blocks of incomplete matches. The efficiency of PBWT-Query and L-PBWT-Query is demonstrated using the simulated data and the UK Biobank data. Our results show that our proposed algorithms can detect related individuals for a given query efficiently in very large cohorts which enables a fast on-line query search. AVAILABILITY AND IMPLEMENTATION: genome.ucf.edu/pbwt-query. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: With the wide availability of whole-genome genotype data, there is an increasing need for conducting genetic genealogical searches efficiently. Computationally, this task amounts to identifying shared DNA segments between a query individual and a very large panel containing millions of haplotypes. The celebrated Positional Burrows-Wheeler Transform (PBWT) data structure is a pre-computed index of the panel that enables constant time matching at each position between one haplotype and an arbitrarily large panel. However, the existing algorithm (Durbin's Algorithm 5) can only identify set-maximal matches, the longest matches ending at any location in a panel, while in real genealogical search scenarios, multiple 'good enough' matches are desired. RESULTS: In this work, we developed two algorithmic extensions of Durbin's Algorithm 5, that can find all L-long matches, matches longer than or equal to a given length L, between a query and a panel. In the first algorithm, PBWT-Query, we introduce 'virtual insertion' of the query into the PBWT matrix of the panel, and then scanning up and down for the PBWT match blocks with length greater than L. In our second algorithm, L-PBWT-Query, we further speed up PBWT-Query by introducing additional data structures that allow us to avoid iterating through blocks of incomplete matches. The efficiency of PBWT-Query and L-PBWT-Query is demonstrated using the simulated data and the UK Biobank data. Our results show that our proposed algorithms can detect related individuals for a given query efficiently in very large cohorts which enables a fast on-line query search. AVAILABILITY AND IMPLEMENTATION: genome.ucf.edu/pbwt-query. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
The increasing volumes of whole-genome genotype data, partly due to the steady drop in the cost of genotyping (Campbell ; Jiang ), offer a new opportunity to study genetic relationships and relatedness between individuals. In the public domain, biobank-scale SNP array projects have collected large amounts of genotype data. For example, UK Biobank has released genotype and health-related data of ∼500 k individuals in the UK (Bycroft ; Sudlow ). However, this sample size is dwarfed by the collections in consumer genetics companies. As of July 2018, 23andMe claims to have over 5 million customers and have collected the genotype data of all customers. AncestryDNA has collected DNAs of more than 10 million individuals. It has been projected that genotype data of more than 100 million individuals will be available through direct-to-consumer companies by 2021 (Khan and Mittelman, 2018).Genetic relationships among individuals are reflected in Identity by Descent (IBD) segments, shared DNA segments between a pair of individuals that have been inherited from a common ancestor. The length of IBD segments correlates with how recently two individuals share a common ancestor. A pair of individuals that share a common ancestor in a more recent generation may share more IBD segments, of larger length, compared with a pair of individuals that share a common ancestor less recently (Thompson, 2013). Therefore, in genotype or haplotype sequences, IBD segments that define a recent common ancestor correspond to long matches of DNA sequences. It has been projected that about 60% of individuals of European descent have a third-degree cousin or closer relative in a current database of over 1 million (Erlich ) that can be found using IBD segments.Previous research efforts were mostly focused on ‘M-vs-M’ matches of IBD segments. For a collection (panel) with M individuals or haplotypes, the goal was to identify all IBD segments between any pairs of individuals in the panel. Several methods have been proposed to detect all pairwise IBD segments in a panel (Browning and Browning, 2011, 2013a, b; Gusev ; Purcell ; Rodriguez , 2015). Most of these works rely on pairwise comparisons and thus with computational complexity O(NM2), where N is the length of the genome. GERMLINE (Gusev ) is a fast ‘M-vs-M’ method claimed to be linear to the size of the panel [O(NM)] for random sequences. However, real genetic sequences typically contain repetitive local haplotypes and thus GERMLINE does not demonstrate linear behavior in real sequences. Durbin (2014) proposed an efficient genotype indexing method for storing and searching haplotype sequences. The proposed method, called Positional Burrows-Wheeler Transform (PBWT), is based on sorting the haplotype sequences based on their reversed prefix order. PBWT enables an efficient haplotype search among the haplotypes within a panel and has been applied for genotype phasing and imputation (Loh , b).For genealogical search, finding all matches greater than or equal to a certain length between a haplotype query and panel is desired. This can be thought of as finding all individuals in a panel related to the query that have a common ancestor in recent generations, which reveals more information about the query than set-maximal matches. A trivial example illustrates our point: in a large population containing a single pair of identical twins, the only set-maximal match that exists between any pair of individuals is the match between the twins spanning the length of the twins. All other relationships between any pair of individuals are missed. In the general case, to a lesser degree, examining only the longest set-maximal matches misses the majority of measurable genetic relationships between pairs of individuals in a population. The length of matches correlates with the number of generations that two individuals share a common ancestor. As a result, it can be applied to find multiple relatives of a given query individual up to a given degree of relatedness.The problem under consideration of this project is to develop an algorithm independent from the number of haplotypes for identifying all long IBD segments between a query and a panel, i.e. ‘1-vs-M’ search, given M haplotypes in the panel. Why is the time complexity independent from the number of haplotypes or individuals important? This is because when the panel contains millions of individuals, naive methods of all pairwise comparison [with computational complexity O(NM)] will be too slow for real-time applications. However, no efficient on-line method yet exists to identify all long IBD segments between a query and a panel. GERMLINE (Gusev ) does yet not offer a direct algorithm for ‘1-vs-M’ search. Durbin’s Algorithm 5 (Durbin, 2014) can find the single longest set-maximal matches between a query haplotype and all haplotypes in the panel with runtime O(N), i.e. independent from the number of haplotypes in the panel. However, in practice, multiple ‘good enough’ matches are desired for a genealogical search. Of note, one might be tempted to repeatedly apply Durbin’s set-maximal matches algorithm and exclude the detected match at each run. This solution would not be practical, since the indices for the panel need to be re-computed or updated after each detected match. To our knowledge, there are no algorithms to find all L-long matches, matches longer than or equal to a given length L between a query and a panel of haplotypes, independent from M.In this work, we present a set of efficient algorithms that can identify all L-long matches of a given query in an arbitrarily large panel. We address the main limitation of Durbin’s original PBWT algorithm that cannot find all L-long matches for a new query efficiently. First, we introduce an efficient approach for finding long matches to a query using PBWT, named PBWT-Query. The key idea of PBWT-Query is to ‘virtually’ insert the query haplotype into the PBWT matrix of the panel, and then scan up and down in the PBWT index for any long match. However, this algorithm may not be efficient when matches are numerous and the cost of scanning is non-negligible. To address this issue, we developed a second algorithm, L-PBWT-Query, that introduces additional pre-computed array data structures, named Linked Equal/Alternating Positions (LEAP) arrays, to help skip unnecessary repetitive up-and-down scanning of PBWT-Query at each site. Moreover, we developed a memory-mapped implementation of the L-PBWT-Query that can alleviate the main memory burden added by the LEAP arrays and the PBWT panel. The time complexity of both PBWT-Query and L-PBWT-Query is independent from the number of haplotypes in the panel.In the next section, we describe the algorithm in detail, followed by simulation results that show the efficiency of PBWT-Query and L-PBWT-Query. Finally, the application on real data is demonstrated by using the one million haplotypes from the UK Biobank data to search for related individuals extracted from the panel.
2 Materials and methods
2.1 Overview and notation
PBWT (Durbin, 2014) facilitates an efficient approach to find all L-long matches, matches longer than or equal to a given length L, among the haplotypes in a panel. It also provides a fast approach to find the longest matches for a haplotype query in a panel. However, finding L-long matches to a query is of greater interest. Suppose we have a haplotype . Then, s[i, j) denotes the subsequence , and denotes the length of s, i.e. . When we compare any haplotype sequences, we always assume that they share the same sites, number of sites, and ordering of sites. To follow the notations of Durbin (2014), we define the haplotype matrix of X as the matrix whose ith row is x.We can formulate our problem as follows: Given a query haplotype z and a database of haplotypes , where for any , we would like to find all L-long matches that are greater than or equal to length L between z and all . We say that there is an L-long match between z and x from sites e to k if , e = 0 or and k = N or , i.e. that match from e to k between z and x cannot be extended in either direction. To find L-long matches to a query, we first find the position of the query in the PBWT panel at each site. All of the possible L-long matches ending at a given site k will occur in contiguous blocks of haplotypes adjacent to the ‘would be’ position of the query in the panel sorted by the reverse of the prefix ending at site k. Hence, we can scan the neighboring sequences in the PBWT panel to find L-long matches. The time complexity of PBWT-Query is , where R is the average length of the matches, and c is the total number of matches.To speed up the search, we use LEAP arrays to reduce the search space of possible matches that terminate at each site k by skipping over haplotypes whose matches can still be extended to further sites. The LEAP arrays consist of arrays and for each site k, which we formally define in the Notation section. These arrays are generated from the PBWT matrix and independent from the query haplotype. We called this new approach L-PBWT-Query, whose worst-case time complexity is O(N + c), where c is the number of L-long matches. The LEAP arrays can be pre-computed in O(NM) time, and require O(NM) space in main memory or hard disk. Therefore, the greatest utility of our approach is the use-case where the user has a large amount of hard disk or main memory space and a constant (or infrequently changing) database haplotype panel and wants to perform many queries on the same panel. In this case, the database population needs only be pre-computed once to achieve subsequently fast. Following, we introduce PBWT data structures and LEAP arrays.
PBWT matrix and positional prefix array
We define the haplotypes in X sorted by the kth reversed prefix (i.e. by the reverse of sites 0 to k − 1) as y, and the positional prefix array a containing a permutation of the indices 0 to M − 1 such that . For every y, we refer to the contiguous block of haplotypes as through ; similarly, through refers to , etc. For example, if X = {1001, 1111, 0100}, then, y3 = {1001, 0100, 1111} and a3 = {0, 2, 1}. Then, we refer to the PBWT matrix as the matrix whose ith column corresponds to the ith column of y.
Divergence array
We define the divergence array d such that d[i] is the starting position of the longest match between and ending at k (Durbin, 2014), i.e. d gives the starts of all the longest matches ending at a given k for any two adjacent haplotypes in y. The longest match between some and (i < j) ending at k begins at .
LEAP arrays
Here, we formally define the LEAP arrays: Ips, Ipd, Ins, Ind, Dps, Dpd, Dns and Dnd. For each , i.e. each column of the PBWT matrix, we maintain four indices: and . Ips and Ins allow us to jump around between equal values within each column of the PBWT matrix. Similarly, Ipd and Ind allow us to jump around between differing values within each column of the PBWT matrix. The subscripts ps, pd, ns and nd are shorthand for previous-same, previous-different, next-same and next-different, respectively, which will make sense below. If j is the largest index such that and j < i, ; that is, we can use to jump in y from a given haplotype to the closest preceding haplotype that has the same value at k. If j is the largest index such that and j < i, ; that is, we can use to jump in y from a given haplotype to the closest preceding haplotype that has a different value at k. If j is the smallest index such that and j > i, ; that is, we can use to jump in y from a given haplotype to the closest proceeding haplotype that has the same value at k. Finally, if j is the smallest index such that and j > i, ; i.e. we can use to jump in y from a given haplotype to the closest proceeding haplotype that has the different value at k. Although these indices facilitate our ability to skip unnecessary haplotypes during our search, when we jump between haplotypes, we need to know if the block of potential L-long matches terminated in the skipped haplotypes. If any divergence value for the skipped haplotypes is larger than L − k, we know that our query cannot have an L-long match ending at k to the haplotype that we jumped to. Therefore, we maintain four additional arrays: Dps, Dpd, Dns and Dnd, which are natural counterparts to Ips, Ipd, Ins and Ind. Each of these stores the largest divergence value between the haplotypes that we skipped, depending on the index that we used to jump. For example, if we use to jump from to will give us ; similarly, if we use to jump from to will give us . Dps, Dpd, Dns and Dnd are undefined when fewer than two haplotypes are skipped using Ips, Ipd, Ins and Ind, respectively; in those cases, we can use the divergence arrays directly.
2.2 PBWT-Query: finding all L-long matches from a new sequence z to X in time
‘Virtual insertion’ of the query haplotype
Figure 1 shows a simple example of virtually inserting a query haplotype to the PBWT panel. The position of the new haplotype at each site will be adjacent to the longest match at the site k. Durbin (2014) proposed an efficient algorithm to find all set-maximal matches between a query z and a panel X, UpdateZMatches [Durbin’s Algorithm 5 (Durbin, 2014)]. A set-maximal match (Durbin, 2014), between z and some is defined as an L-long match () between z and x from e to k such that there does not exist a match between z and any other from to where or . Simply, the match between z and x is set-maximal if it cannot be extended in either direction, and z does not have a larger match to any other haplotype in X whose indices enclose the indices of its match to x. To find all set-maximal matches at k, UpdateZMatches computes three values: f, g and e such that z has a set-maximal match to through from e to k if through . It computes f, g and e based on and . A detailed routine for computing f, g and e can be found in the Supplementary Algorithm S1. To follow the notations of Durbin (2014), we define w such that and , where u[i] is the number of zero values at site k in through is the number of one values at site k in through , and c is the total number of zeros at site k.
Fig. 1.
An example of virtually inserting a new query haplotype to a PBWT panel at site k. The haplotype sequences of the panel are sorted based on the reversed prefix order in the PBWT panel at site k
An example of virtually inserting a new query haplotype to a PBWT panel at site k. The haplotype sequences of the panel are sorted based on the reversed prefix order in the PBWT panel at site kIt is the case that if z were in y, it would occur in sorted order immediately preceding either or . This is because matches through , but (or e = 0). In both cases, if , z can go in sorted order immediately preceding . Otherwise, it can immediately proceed to . We refer to this index of y which would immediately proceed to z as h, and we can use this to find all L-long matches between z and X ending at k.through all have a match to of at least length , and through all have a match to of at least length . This implies that if we pick some haplotype in X, we know that all matches to ending at k of length greater than or equal to L occur in some block through in y such that t < b, and . It follows that all matches to X ending at k, whether from an external query haplotype or internal haplotype, occur in a contiguous block at y.
Reporting all L-long matches
We know that all matches to z ending at k of length greater than or equal to L occur in a contiguous block through in y such that t < b, and . What is left, then, is to find the smallest index t and largest index b that enclose the block of haplotypes with potential L-long matches. We say potential because each is only a match if , but not an L-long match, as the end of the match can be extended, i.e. matches . We define to be the smallest value such that matches and to be the smallest value such that matches . We can use and d to scan up y beginning at until we find (except in the case where , printing L-long matches along the way. Similarly, we can use and d to scan down y beginning at until we find (except in the case where , printing L-long matches along the way. Specifically, we can keep scanning in either direction so long as for the particular we are iterating through [similar to Durbin’s Algorithm 3 (Durbin, 2014)], and we can print an L-long match between z and so long as . Although scanning up, if , the match between z and a particular would begin at . Similarly, while scanning down, if b > h, the match between z and a particular would begin at . A detailed procedure of this approach is presented in the Supplementary Algorithm S2.Figure 2 illustrates searching for L-long matches for L = 3. For clarity, we have included example ys for k = 1, 2, 3, where we have highlighted in gray the reverse prefixes that are used to sort the rows of the haplotype matrix, and have included the sorted order of z in these panels. The key idea of our approach is, scanning through each site k from 0 to N, to find the would-be position of the query z in y. Then, we can scan up and down site k in y to search for L-long matches that end at that site. First, at each site k, we find the block of potential set-maximal matches ending at k, labeled in purple. If z were in y, it would occur at the beginning or end of this block. In this example, z always happens to be at the beginning of the purple blocks, but it can also occur at the end. As we scan through this block, we output an L-long match ending at each green position if, at that position, there is a mismatch between that individual and the query.
Fig. 2.
An example of searching for L-long matches of length ≥3 in a panel with five haplotypes comprising eight sites. (A) The haplotype matrix (y0) and the query z. (B) The PBWT matrix. ith column in the PBWT matrix corresponds to the ith column of y. (C) The positional prefix arrays. Each row depicts the positional prefix array a. Each positional prefix array a contains the permutation of indices 0 to N − 1 such that . (D)ys for k = 1, 2, 3. The haplotypes in X sorted by the kth reversed prefix (i.e. by the reverse of sites 0 to k − 1) are referred as y. The prefixes used to sort are shaded in gray in y1, y2 and y3. The blocks of potential set-maximal matches ending at k are labeled in purple, the index of the haplotype that immediately proceeds to z is highlighted as red, and the intervals of potential L-long matches ending at each site are highlighted in lime green. f and g which enclose the block of potential set-maximal matches are underlined
An example of searching for L-long matches of length ≥3 in a panel with five haplotypes comprising eight sites. (A) The haplotype matrix (y0) and the query z. (B) The PBWT matrix. ith column in the PBWT matrix corresponds to the ith column of y. (C) The positional prefix arrays. Each row depicts the positional prefix array a. Each positional prefix array a contains the permutation of indices 0 to N − 1 such that . (D)ys for k = 1, 2, 3. The haplotypes in X sorted by the kth reversed prefix (i.e. by the reverse of sites 0 to k − 1) are referred as y. The prefixes used to sort are shaded in gray in y1, y2 and y3. The blocks of potential set-maximal matches ending at k are labeled in purple, the index of the haplotype that immediately proceeds to z is highlighted as red, and the intervals of potential L-long matches ending at each site are highlighted in lime green. f and g which enclose the block of potential set-maximal matches are underlinedThe time complexity of computing f, g and e is O(N) across all k (Durbin, 2014). For a given site, we may scan through all M haplotypes in search of t and b. If there is a block of matches that does not terminate at each site, we will still need to scan the entire block. As we do this for N sites, the time complexity of PBWT-Query is , where R is the average length of the matches and c the total number of matches. In the next section, we introduce L-PBWT-Query which employs additional data structures to achieve a time complexity independent from the number of haplotypes and the lengths of the matches.
2.3 L-PBWT-Query: finding all L-long matches from a new sequence z to X in time
Although searching for t and b, the smallest and largest indices that contain potential L-long matches, every scanned through where (i.e. the match can be extended further) is unnecessary work. Therefore, if we can restrict our search at each site only to those where , we can improve the time complexity of searching for t and b across all sites to , where c is the number of L-long matches. We can achieve this by maintaining a data structure that allows us to efficiently jump between of the same k, for i = 0 to M − 1 and k = 1 to N − 1. Again, the key idea is that we want to jump around between only those where . For bi-allelic data, if z[k] = 1, we primarily want to move around that equal to 0, and similarly, if z[k] = 0, we primarily want to move around between that equal to 1, to find matches that end at k + 1.
Speeding up PBWT-Query using LEAP arrays
Using our data structure, we can modify the procedures to scan up and down in search of t and b once we have found h. Namely, when we scan up, we can use or to find the first () such that , if such exists. Then, if necessary, we can use to keep searching only through haplotypes that differ from z at site k. Algorithm 1 demonstrates the updated routines which efficiently scan up a given y to find t and b using our data structure. Similarly, when we scan down, we can use or to find the first () such that , if such exists. Then, if necessary, we can use to keep searching only through haplotypes that differ from z at k. A detailed routine of the algorithm for scanning down is included in the Supplementary Algorithm S3. Then, Algorithm 2 demonstrates our updated L-PBWT-Query that makes use of our improved scanning routines in Algorithm 1 and Supplementary Algorithm S3.Figure 3 demonstrates y30 for an example genotype panel, along with the indices used to facilitate efficient query search. In the panel, t and b refer to the indices that define the block of haplotypes through where all L-long matches for L = 4 ending at site 30 may occur. Within that block of haplotypes, we are interested in those where . Since we would have to scan up from to find t and scan down from to find b, without the indices, we could end up scanning through the entire column, a worst-case O(M) operation; across all sites, we could end up performing O(NM) operations. However, with the indices, we only ever traverse through those sequences where the match ends at 30; so, at site 30, our scanning up and down is improved to worst-case O(number of matches ending at 30), and therefore, across all sites, our runtime is improved to worst-case , where c is the number of L-long matches.
Fig. 3.
An example of a haplotype panel with its corresponding indices to facilitate the search for L-long matches for L = 4 ending at site 30, exclusive. The indices allow jumping from any to the nearest preceding or proceeding such that or . and give us the largest divergence values from haplotypes skipped using and , respectively. The indices for the seventh sequence in y30 are highlighted in the table. t and b refer to the top and bottom of the block of potential L-long matches in y30, and h refers to the sequence that would immediately proceed to the query if it were in y30
An example of a haplotype panel with its corresponding indices to facilitate the search for L-long matches for L = 4 ending at site 30, exclusive. The indices allow jumping from any to the nearest preceding or proceeding such that or . and give us the largest divergence values from haplotypes skipped using and , respectively. The indices for the seventh sequence in y30 are highlighted in the table. t and b refer to the top and bottom of the block of potential L-long matches in y30, and h refers to the sequence that would immediately proceed to the query if it were in y30As before, the time complexity of computing f, g and e is O(N) across all k (Durbin, 2014). Now, our worst-case time complexity for scanning for t and b across all sites is O(N + c), where c is the number of L-long matches. So, our total worst-case time complexity for L-PBWT-Query is O(N + c), which is independent from the number of haplotypes M. Although our querying approach has a runtime linear in N and the number of L-long matches, our approach assumes that the PBWT, I matrices and D matrices for X are pre-computed, and stored in the hard disk. These can be computed in O(NM) time, and they occupy O(NM) space in the hard disk. Specifically, if X occupies NM memory, the additional I and D matrices occupy roughly 8NM memory.Scan Up For L-Long Matchesfunction ScanUp(k, h, d)if and match ends at k + 1 thenreport match to from d to k + 1ifthenif and then//Useorto find the first() such// that , if such existsif and thenelse ifthen//If necessary, we can useto keep searching only//through haplotypes that differ from z at site kwhile and doreport match to from dmax to k + 1ifthenelsewhiledoifthenreport match to from dmax to k + 1L-PBWT-Queryfor
to N − 1 do//z matchesthroughfromuntil//at least k + 1UpdateFandG(k)//z can be inserted intoin sorted order either//beforeor beforeifthenelseifthenScanUp(k, )ifh < MthenScanDown(k, )
Memory-efficient implementation of L-PBWT-Query
We implemented two versions of L-PBWT-Query: L-PBWT-Query (Memory-Mapped), where all of the pre-computed data structures are accessed using memory-mapped files using Boost libraries, and L-PBWT-Query (Memory-Extensive), where all of the pre-computed data structures are loaded into main memory. To clarify, L-PBWT-Query (Memory-Mapped) does not load all of the panel and data structures in main memory at once. Instead, parts of the panel and data structures are loaded from the appropriate files into memory in a ‘lazy loading’ fashion, i.e. as they are needed. Therefore, using memory-mapped files reduces I/O operations. This mechanism provides a relatively fast alternative to access the panel and data structures, especially if a panel has been recently queried by queries relatively similar to subsequent queries, which we refer to as ‘warming up’ a panel. When a panel is warmed up, relevant portions of the panel and data structures (a relatively small subset of the overall panel and data structures) have already been loaded into main memory and mapped to virtual memory, for faster subsequent access.
3 Results
3.1 Benchmarking using simulated data
We simulated a large haplotype panel using the Markovian Coalescent Simulator (MaCS) Chen with the command macs 500001 2000000 -t .001 -r .001 -h 1e2 and extracted subsets of the panel for our benchmarking. For the purpose of benchmarking, we implemented PBWT-Query (Memory-Extensive) and Exhaustive Search. Exhaustive Search scans across the entire length of sites for each pair of the query and a haplotype in the panel, which has O(NM) time complexity. For all of our benchmarks, we used the following protocol: When running L-PBWT-Query (Memory-Mapped), for a given panel, we always warmed up a panel with three runs on the particular query, then averaged the runtime of three additional runs. When running PBWT-Query (Memory-Extensive), L-PBWT-Query (Memory-Extensive) and Exhaustive Search, we averaged the runtimes of three runs.First, we want to verify that the runtime of PBWT-Query and L-PBWT-Query, in practice, is truly independent from the number of haplotypes. To that end, we developed a benchmark which we refer to as the increasing haplotypes benchmark. Figure 4 and Supplementary Table S1 show the results of our increasing haplotypes benchmark, respectively, which support our assertion that the runtime of our approach is independent from the number of haplotypes. For this benchmark, we tested panels where we successively increased the number of haplotypes in the panel by a constant amount (from 20 000 to 200 000 haplotypes in steps of 20 000). In the first experiment, we kept the number of sites (10 000) and the number of matches (25 359) constant for the added haplotypes (Fig. 4A and Supplementary Table S1). The minimum length of match L was set to 1000. We can see that, indeed, for the Exhaustive Search, run time increases roughly linearly with the number of haplotypes (M). The runtime of our memory-mapped implementation stays roughly constant, even as we increase the number of haplotypes ten times. When compared with L-PBWT-Query (Memory-Extensive), there appears to be some constant overhead associated with fetching data from memory-mapped files, instead of cache or main memory. PBWT-Query runtime is similar to L-PBWT-Query as the number of matches remains constant. In the second experiment, we increased the number of haplotypes and the number of matches increases linearly (Fig. 4B and Supplementary Table S1). Both L-PBWT-Query implementations show a better runtime compared with PBWT-Query with increasing number of haplotypes and increasing number of matches.
Fig. 4.
Running time (in seconds) of searching for a query in panels containing 10 000 sites with increasing number of haplotypes (A) while keeping the number of matches constant and (B) with increasing number of matches
Running time (in seconds) of searching for a query in panels containing 10 000 sites with increasing number of haplotypes (A) while keeping the number of matches constant and (B) with increasing number of matchesAdditionally, we want to concretely investigate the additive effect of increasing sites and increasing matches on the runtime of PBWT-Query and L-PBWT-Query. To that end, we developed a benchmark which we refer to as the increasing sites benchmark. Figure 5 is a plot of our increasing sites benchmarks, and Supplementary Table S2 gives the runtimes of those benchmarks. We performed two benchmarks: one where we successively increase the number of sites (from 20 000 to 200 000) while keeping the number of matches constant (Fig. 5A), and another where we successively increase the number of sites (from 20 000 to 200 000) while the number of matches increases linearly with the number of sites (Fig. 5B). From both of these figures, we can see that the runtime of our approach increases as expected, linearly with the number of sites and matches. As before, there is some constant overhead associated with fetching from memory-mapped files instead of main memory. All benchmarks were run on a 2.1 GHz server with 500 GB of RAM. The maximum resident sizes for the benchmarks are included in the Supplementary Table S3.
Fig. 5.
Running time (in seconds) of searching for a query in panels containing 20 000 haplotypes with increasing number of sites: (A) while keeping the number of matches relatively constant (∼70 000 matches) and (B) with increasing number of matches linearly with the number of sites
Running time (in seconds) of searching for a query in panels containing 20 000 haplotypes with increasing number of sites: (A) while keeping the number of matches relatively constant (∼70 000 matches) and (B) with increasing number of matches linearly with the number of sites
3.2 Applying PBWT-Query and L-PBWT-Query on the UK Biobank data
PBWT-Query and L-PBWT-Query were tested on the UK Biobank data (487 409 participants and 974 818 haplotypes for autosomal chromosomes) (Bycroft ) to demonstrate their utility on real data. The UK Biobank dataset includes pairwise kinship coefficients between individuals computed using the KING toolset Manichaikul . According to the KING tutorial, the kinship coefficient ranges [0.177, 0.354], [0.0884, 0.177] and [0.0442, 0.0884] denote first-, second- and third-degree relationships, respectively. Here, a first-degree relationship refers to a parent–offspring or full-sibling relationship, a second-degree relationship includes half-siblings, avuncular pairs and grandparent–grandchild pairs, and a third-degree relationship includes the first-cousins Manichaikul .The aim is to investigate whether potential genetic relationships from the UK Biobank data can be inferred by searching for exact matches using PBWT-Query or L-PBWT-Query. Because true IBDs can be disrupted by mismatches, the sum of the lengths of L-long matches between a pair of individuals was tested as a potential signal to differentiate first-, second- and third-degree relationships from each other, and from a background population. Two hundred individuals (400 haplotypes) which had genetic relationships within the UK Biobank data were chosen as queries. As a representative sample of the background population, one thousand individuals were randomly chosen from the UK Biobank excluding the 200 query individuals. The two hundred individuals were queried against the entire UK Biobank data for exact matches of length ≥ 700 SNPs in Chromosomes 1 through 22. This match length cutoff was chosen by following the precedent of 23andMe, whose simulations show that first, second and third cousin relationships can be reliably detected by haplotype matches of length ≥ 700 SNPs and 7 cM across all chromosomes. If we consider the detection power as the percentage of related pairs with any matches of length ≥ 700 SNPs, there is a 100% detection power in detecting potential first- and second-degree relationships by searching only in Chromosome 1. The detection power for first, second and third-degree relationships using all autosomal chromosomes is 100%. The queries were run using PBWT-Query and L-PBWT-Query (Memory-Extensive) on a 2.5 GHz server with 15 TB of SSD and 6 TB of RAM. The average time for running L-PBWT-Query a single query (one haplotype) on Chromosomes 1 through 22 is 6 s and for PBTW-Query 20 s, discounting the time to load the L-PBWT-Query data structures into memory. The maximum resident set size for L-PBWT-Query was 4.7 TB, and 2.4 TB for PBWT-Query.There is a promising separation between the three degrees of relatedness and the background population. Figure 6A shows that the sum of the lengths of L-long matches between a pair of individuals is a capable signal to differentiate first-, second- and third-degree relationships from random pairs picked from the background population, and to filter potential first- second- and third-degree relationships for further processing. Figure 6B shows an example of detected identical segments in Chromosome 1 between an individual (A) and two of their relatives (B and C) in the UK Biobank data. The two haplotypes for each individual are distinguished by 1 and 2 (e.g. A has haplotypes A1 and A2). The total length of the segments shared with the first-degree relative is significantly larger than those shared with the third-degree relative. We computed the AUC values to differentiate three degrees of relationships using the sum of the lengths of the shared segments. The AUC value between first- and third-degree relationships is 0.98, and AUC value between first- and second-degree relationships is 0.96. There are a few outliers in the second-degree relationships; however, the AUC value between the second- and third-degree relationships is 90%. Thus, identical segments interrupted due to genotyping or phasing errors can be used to infer the related individuals, and to estimate the degree of relatedness between closely related individuals, without careful post-processing of the interrupted matches.
Fig. 6.
(A) Probability distributions of the sum of L-long matches (in cM) between the query and related individuals (first-, second- or third-degree), and random individuals in the UK Biobank data. Relatedness is computed using KING. (B) An example of detected identical segments in Chromosome 1 (in bps) for an individual with at least two relatives (first- and third-degree relatives) in the UK Biobank data
(A) Probability distributions of the sum of L-long matches (in cM) between the query and related individuals (first-, second- or third-degree), and random individuals in the UK Biobank data. Relatedness is computed using KING. (B) An example of detected identical segments in Chromosome 1 (in bps) for an individual with at least two relatives (first- and third-degree relatives) in the UK Biobank data
4 Discussion
In this work, we proposed two efficient algorithms, PBWT-Query and L-PBWT-Query, for finding all L-long matches between a query haplotype and a panel. The time complexity of L-PBWT-Query does not rely on the number of haplotypes in the panel, which enables on-line genealogical search in large cohorts. Furthermore, the memory-mapped version of the algorithm, L-PBWT-Query (Memory-Mapped), will facilitate fast search when extensive main memory is not available, in exchange for a slightly increased running time. PBWT-Query shows similar runtime to L-PBWT-Query as the number of matches remains constant with the increasing number of haplotypes. However, the difference in efficiency becomes more obvious with the increasing number of matches.L-PBWT-Query (Memory-Mapped) does not require loading the entire panel into the main memory in order to facilitate a fast search. The tradeoff, however, is an increased running time due to increased I/O operations. We demonstrated that the running time of the L-PBWT-Query (Memory-Mapped) version is slightly worse than L-PBWT-Query (Memory-Extensive), but L-PBWT-Query (Memory-Mapped) will be more practical if extensive memory resources are not available.We applied L-PBWT-Query on ∼500 k individuals from the UK Biobank data and were able to detect close relatives of query individuals. The running time for all autosomal chromosomes in the UK Biobank data would be only a few seconds using a single CPU. Our results show that very close relatives can be easily found by L-PBWT-Query. We also ran PBWT-Query on the UK Biobank data. Although the memory usage for PBWT-Query was lower than L-PBWT-Query, the run time was moderately worse. To further improve detection, consideration of haplotype phasing quality will be needed. Another limitation of our work is that we have been focusing on developing efficient algorithms for exact matches of a query in the panel. In order to enhance the application on real datasets and detect more distant relatives, the algorithms need to be made tolerant to genotyping errors or mutations that could have occurred in real data, e.g. by random projection (Naseri ).Click here for additional data file.
Authors: Shaun Purcell; Benjamin Neale; Kathe Todd-Brown; Lori Thomas; Manuel A R Ferreira; David Bender; Julian Maller; Pamela Sklar; Paul I W de Bakker; Mark J Daly; Pak C Sham Journal: Am J Hum Genet Date: 2007-07-25 Impact factor: 11.025
Authors: Alexander Gusev; Jennifer K Lowe; Markus Stoffel; Mark J Daly; David Altshuler; Jan L Breslow; Jeffrey M Friedman; Itsik Pe'er Journal: Genome Res Date: 2008-10-29 Impact factor: 9.043
Authors: William A Freyman; Kimberly F McManus; Suyash S Shringarpure; Ethan M Jewett; Katarzyna Bryc; Adam Auton Journal: Mol Biol Evol Date: 2021-05-04 Impact factor: 16.240