Literature DB >> 32446220

Maximal Perfect Haplotype Blocks with Wildcards.

Abstract

Recent work provides the first method to measure the relative fitness of genomic variants within a population that scales to large numbers of genomes. A key component of the computation involves finding maximal perfect haplotype blocks from a set of genomic samples for which SNPs (single-nucleotide polymorphisms) have been called. Often, owing to low read coverage and imperfect assemblies, some of the SNP calls can be missing from some of the samples. In this work, we consider the problem of finding maximal perfect haplotype blocks where some missing values may be present. Missing values are treated as wildcards, and the definition of maximal perfect haplotype blocks is extended in a natural way. We provide an output-linear time algorithm to identify all such blocks and demonstrate the algorithm on a large population SNP dataset. Our software is publicly available.

Entities: Chemical Disease Gene Species

Keywords: Bioinformatics; Genetics; Genomics

Year: 2020 PMID： 32446220 PMCID： PMC7243190 DOI： 10.1016/j.isci.2020.101149

Source DB: PubMed Journal: iScience ISSN： 2589-0042

Introduction

Several recent works have explored the problem of determining maximum perfect haplotype blocks (MPHBs) in large population datasets where each individual in the dataset is genotyped at a set of single nucleotide polymorphism (SNP) sites along their genome. These data can be viewed as a large matrix, where the rows represent each individual or sample and the columns are the SNP locations along a chromosome of interest. Typically, a indicates no change from the reference and a represents a change. The study of algorithms to find MPHBs was motivated by their usefulness in determining the fitness of genetic variations for a set of loci along the genome by computing the selection coefficient (Gillespie, 2004, Chapter 5.3), which indicates how stable each loci is from an evolutionary standpoint. With an algorithm for computing MPHBs that scales to entire chromosomes and thousands of individual samples, it is possible to investigate the selective pressure at each locus over an entire chromosome. The first such algorithm runs in quadratic time (Cunha et al., 2018); more recent work (Alanko et al., 2020) gives two methods for finding MPHBs that both run in linear time. The first linear time method is based on suffix trees, and the second is based on the positional Burrows-Wheeler Transform (pBWT), both of which are data structures developed for efficient processing of string (alphabet character) data. In practice, the authors found the pBWT-based method to be the faster of the two. In this work, we extend the definition of maximal perfect haplotype blocks to include the possibility of missing values in the data. We model such missing values as wildcards. Sequencing data are often noisy and missing SNP calls are common, so it is natural to consider this variation of the problem. Wildcards have also been considered in other pattern matching contexts; e.g., researchers have considered an indexing scheme for a text containing wildcards and a query pattern without wildcards (Tam et al., 2009) and indexing a text without wildcards to search for patterns with wildcards (Bille et al., 2014). We update the definition of MPHBs to permit some SNP values to be wildcards (represented as ∗’s in Figure 1). We assume that in each column there are samples for each SNP variation, e.g., there is at least one row with a value and at least one row with a value. This assumption guarantees that each block resolves to a single sequence of SNPs, which simplifies the representation of blocks. Following the notation from the paper that first introduced MPHBs (Cunha et al., 2018), we index SNPs (columns) as through and sequences (rows) as through . For a set of sequences and a subset of the row indices , we define as the -induced subset of . That is, .

Figure 1

Six Sequences with Three SNPs Each

There can be many possible maximal perfect wildcard haplotype blocks if many wildcards are present; in this example, there are 22 possible blocks, e.g., is one of the blocks.

Six Sequences with Three SNPs Each There can be many possible maximal perfect wildcard haplotype blocks if many wildcards are present; in this example, there are 22 possible blocks, e.g., is one of the blocks. Given sequences of length , a maximal perfect wildcard haplotype block is a triple with , , and such that , (consensus), (left-maximality), (right-maximality), , , (row-maximality). Figure 1 shows a set of six sequences with three SNPs each. There, the ordered triple is a maximal perfect wildcard haplotype block, because it satisfies all four of the above conditions: At every column , the sequences included in the block of rows contain either a 0 or a 1 but do not contain both. i = 1, so the block cannot be extended left. Among sequences , there exists at least one with and at least one with , so the block cannot be extended right. For every sequence not included in the block, disagrees with a sequence included in the block at at least one position, so the block cannot contain any additional rows. By contrast, the ordered triple is not an MPWHB, since and , violating the first (consensus) property. The maximal perfect wildcard haplotype block (MPWHB) problem is to find all maximal perfect wildcard haplotype blocks for a given set of sequences. In previous work (Cunha et al., 2018), it was shown that in the original formulation of the problem without wildcards, there are at most possible maximal perfect haplotype blocks that end at each column in the dataset; thus there are most maximal perfect haplotype blocks in total. The number of maximal haplotype blocks grows considerably if wildcards are present. The example shown in Figure 1 illustrates a simple construction in which for each column (SNP position), there is exactly one row with a value and one row with a value. In general, this construction has SNPs and sequences, with and for and all other values set to be wildcards. It is easy to check that, for any and any binary string of length , there is a subset of the sequences such that forms an MPWHB that agrees with in all non-wildcard positions. Summing over the possible lengths, the total number of MPWHBs isas there are starting positions for a block length and possible agreed upon strings. In practice, we expect that there will be many fewer solutions in normal biological inputs. Motivated by this, we propose the objective of finding an algorithm whose running time is linear in the number of blocks found. In the remainder of the paper, we will refer to MPWHBs simply as blocks for convenience. In the Supplemental Information Transparent Methods Section S, we discuss how to represent all blocks using trie data structure and describe an algorithm for finding all solutions using depth-first search to implicitly search the trie. A trie (pronounced “try”) is a tree where all nodes are labeled with a single letter from an alphabet; words that are present in a collection can be retrieved by following any path from the root (top) of the tree to any terminal (leaf) node. For us, the “words” in the trie are representations of the blocks in the set of sequences. Depth-first search is a standard method for searching through a tree data structure that we can use to explore the entire trie systematically and efficiently. Figure S.1 shows an example of the trie data structure built over a small dataset; Algorithm 1 describes the algorithm. The running time of the algorithm is , where is the total number of blocks found.

Results

We tested our method for finding blocks using the human chromosome 22 dataset from 1000 Genomes Project (1000 Genomes Project Consortium et al., 2015) that was also used by others studying MPHBs (Alanko et al., 2020). This was the smallest dataset they looked at, but it still consists of sequences and SNPs and so is representative of current larger SNP population datasets. Our implementation, written in Java, employs a built-in parallel loop construct for multi-threading. The algorithm was run on a high-performance research cluster (Hyalite High Performance Computing System, operated and supported by University Information Technology Research Cyberinfrastructure at Montana State University), and each experimental run was done on a single Intel Xeon Ivy Bridge or similar node with either or cores available and 256 Gb of memory. Table 1 summarizes the results as we varied both the percentage of random wildcards (defined as the probability that a specific SNP call is a ∗) and the minimum block area threshold. The area of a block is defined as , the number of samples in the block times the width of the block in terms of number of SNPs. The minimum block area is given as a parameter to the algorithm and is also used to prune the trie search; paths that cannot lead to a block that meets the minimum area requirement are not explored. We did not optimize I/O, so only start timing each run after the SNP input file has been read into memory (which takes several minutes). We note that, although our experiments follow the general form of those performed in previous work (Alanko et al., 2020), the running times presented here are not directly comparable with those reported there, since the implementations are written in different languages and the experiments performed on different systems. It is possible to optimize the DFS function to receive only the indices of the rows that are lost at each call, meaning that Algorithm 1 takes space, accounting for the storage of up to row indices along the currently explored trie path, and the length cons.cnt vector. However, the algorithm reads the entire by matrix into RAM, so this dominates memory use. For this reason, we did not record memory usage in our experiments.

Table 1

Summary of Experiments Varying the Wildcard Frequency and Minimum Block Area for an SNP Dataset for Human Chromosome 22 Consisting of 5,008 Sequences and 1,055,454 SNPs

Wildcards	Min. Area	Run Time	# Blocks	#DFS Calls/Block	\|K\|	#SNP
0%		13 min 28 s	148,613,645	35.5	1,497.3	490.0
0%	500,000	19 min 37 s	16,076,453	294.5	1,498.5	690.4
0%	1,000,000	18 min 11 s	2,228,762	1,888.4	1,659.1	889.2
0%	2,000,000	13 min 47 s	4,779	660,363.0	1,634.9	1,287.9
5%		2 h 22 min	506,675,436	30.8	545.9	426.1
5%	500,000	2 h 12 min	18,155,762	815.9	1,477.3	710.9
5%	1,000,000	1 h 47 min	2,652,944	5,277.8	1,645.0	909.8
5%	2,000,000	2 h 9 min	13,387	926,786.2	1,546.8	1,380.9
10%		5 h 32 min	1,128,831,659	27.3	334.4	439.9
10%	500,000	4 h 18 min	20,144,453	1,471.3	1,455.3	736.5
10%	1,000,000	5 h 21 min	3,101,221	9,157.3	1,627.0	1,627.0
10%	2,000,000	5 h 3 min	36,145	721,431.8	1,506.1	1,452.0

The rightmost three columns are averages.

Summary of Experiments Varying the Wildcard Frequency and Minimum Block Area for an SNP Dataset for Human Chromosome 22 Consisting of 5,008 Sequences and 1,055,454 SNPs The rightmost three columns are averages. Although block statistics vary significantly when there is no minimum area threshold, we see that the distributions of the larger blocks are not drastically different when wildcards are present. We also plot the distributions of block shapes found for the varying wildcard rates and a minimum block area of in Figure 2.

Figure 2

Scatterplots Showing the Distributions of Maximal Perfect Wildcard Haplotype Block Shapes Found for the Different Wildcard Rates and the Minimum Block Area Threshold Set to 500,000

Discussion

In this work, we give the first method for computing maximal perfect haplotype blocks in the presence of missing data. To do so, we define the maximal perfect wildcard haplotype block problem and give an output-linear time algorithm to solve it, with a running time of , where is the total number of blocks found. Although it is possible to find maximal perfect haplotype blocks (without wildcards) in time (Alanko et al., 2020), that method cannot be directly adapted to the wildcard setting, because rows with wildcards cannot be sorted. It remains to be shown whether a faster algorithm can be given in the wildcard case. Our experimental results suggest that randomly distributed missing values do not substantially alter the distribution of larger blocks. Thus, for applications such as estimating the selection coefficients of loci along a genome, missing SNP values should be tolerable. We plan to investigate this further on synthetic and actual datasets with missing values. In recent work (Williams and Mumey, 2020), we examined a version of the maximal perfect haplotype problem for pangenomic data. It would also be natural to consider missing values in those data; this would lead to a version of the problem where the sequences are paths in a graph and a wildcard could indicate that a path goes through an SNP but the value was unable to be called.

Limitations of the Study

We focused on updating the theory around maximal perfect haplotype blocks to account for unknown SNPs and giving an algorithm to find all blocks with wildcards. The main application of finding blocks has been to compute selection coefficients, so an important next step is to develop a method and a tool to compute selection coefficients in the presence of unknown SNPs. In our experiments, we used only one human chromosome; future work could study a larger set of human chromosomes or datasets from other organisms. Additionally, we looked at only randomly distributed unknown values; real data may have unknown SNPs in a non-random fashion that could skew the distributions of large blocks. We proposed an algorithm for finding all blocks with an asymptotic runtime of time per block found; an interesting algorithms question is whether blocks can be found asymptotically faster than this.

Resource Availability

Lead Contact

Requests for further information should be directed to and will be fulfilled by the Lead Contact, Lucia Williams (lucia.williams@montana.edu).

Materials Availability

No materials were generated in this study.

Data and Code Availability

VCF files from phase three of the 1000 Genomes Project are available from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/. We used chromosome 22 and processed it to a binary haplotype matrix using the program vcf2bm, available at https://gitlab.com/bacazaux/haploblocks. Code to run our algorithm and experiments on a binary matrix is available at github.com/msu-alglab/WildHap/.

Methods

All methods can be found in the accompanying Transparent Methods supplemental file.

2 in total