| Literature DB >> 32055252 |
Jarno Alanko1, Hideo Bannai2, Bastien Cazaux1, Pierre Peterlongo3, Jens Stoye4.
Abstract
Recent large-scale community sequencing efforts allow at an unprecedented level of detail the identification of genomic regions that show signatures of natural selection. Traditional methods for identifying such regions from individuals' haplotype data, however, require excessive computing times and therefore are not applicable to current datasets. In 2019, Cunha et al. (Advances in bioinformatics and computational biology: 11th Brazilian symposium on bioinformatics, BSB 2018, Niterói, Brazil, October 30 - November 1, 2018, Proceedings, 2018. 10.1007/978-3-030-01722-4_3) suggested the maximal perfect haplotype block as a very simple combinatorial pattern, forming the basis of a new method to perform rapid genome-wide selection scans. The algorithm they presented for identifying these blocks, however, had a worst-case running time quadratic in the genome length. It was posed as an open problem whether an optimal, linear-time algorithm exists. In this paper we give two algorithms that achieve this time bound, one conceptually very simple one using suffix trees and a second one using the positional Burrows-Wheeler Transform, that is very efficient also in practice.Entities:
Keywords: Haplotype block; Population genomics; Positional Burrows–Wheeler Transform; Selection coefficient
Year: 2020 PMID: 32055252 PMCID: PMC7008532 DOI: 10.1186/s13015-020-0163-6
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Fig. 1Illustration of Definition 1. A binary haplotype matrix with three maximal perfect haplotype blocks , and highlighted. (The example contains additional maximal perfect haplotype blocks that are not shown.)
Fig. 2Available blocks. Left: an example of a haplotype matrix up to column 6 with the two arrays and on the right. Center: the colexicographically sorted rows and the array listed on the right. Right: the trie of the reverses of the rows of the matrix. For example, the block is available because , , , is the consecutive range , we have for all with , and we have and . The repeat in the block is 00, and we see it is a branching node in the trie on the right
Running times and memory usage of our pBWT-based implementation
| Data set | #lines | #columns | Min block size | Time | Memory (MB) | #blocks |
|---|---|---|---|---|---|---|
| chr. 22 | 5008 | 1,055,454 | 4 min 54 s | 12.8 | 148,613,645 | |
| chr. 22 | 5008 | 1,055,454 | 500,000 | 3 min 50 s | 12.8 | 16,076,453 |
| chr. 22 | 5008 | 1,055,454 | 1,000,000 | 3 min 40 s | 12.8 | 2,228,762 |
| chr. 22 | 5008 | 1,055,454 | 2,000,000 | 3 min 43 s | 12.8 | 4779 |
| chr. 6 | 5008 | 4,800,101 | 19 min 42 s | 12.8 | 624,689,548 | |
| chr. 6 | 5008 | 4,800,101 | 500,000 | 17 min 20 s | 12.8 | 89,840,467 |
| chr. 6 | 5008 | 4,800,101 | 1,000,000 | 16 min 30 s | 12.8 | 11,388,982 |
| chr. 6 | 5008 | 4,800,101 | 2,000,000 | 16 min 36 s | 12.8 | 5585 |
| chr. 2 | 5008 | 6,786,300 | 31 min 57 s | 12.8 | 946,717,897 | |
| chr. 2 | 5008 | 6,786,300 | 500,000 | 25 min 06 s | 12.8 | 160,094,115 |
| chr. 2 | 5008 | 6,786,300 | 1,000,000 | 23 min 24 s | 12.8 | 25,533,314 |
| chr. 2 | 5008 | 6,786,300 | 2,000,000 | 23 min 18 s | 12.8 | 120,243 |
Note that in our streaming implementation the memory usage is dominated by the number of haplotypes times the buffer size, and therefore is essentially constant in this study
Comparison of the trie-based implementation from [8] and our pBWT-based implementation with minimum block size
| Data set | trie | pBWT | ||
|---|---|---|---|---|
| Time | Memory | Time | Memory (MB) | |
| chr. 22 | 17 min 08 s | 927.8 MB | 3 min 40 s | 12.8 |
| chr. 6 | 1 h 34 min 34 s | 3.23 GB | 16 min 30 s | 12.8 |
| chr. 2 | 2 h 07 min 21 s | 4.46 GB | 23 min 24 s | 12.8 |
Fig. 3Selection scan for human chromosome 2. Shown is for each position of the chromosome the largest maximum likelihood estimate derived from any maximal perfect haplotype block overlapping that locus. It is easy to spot potential regions of high selection. The centromere, located around 93 Mbp, shows no signal as sequencing coverage is low here and no SNPs could be called