| Literature DB >> 32446220 |
Lucia Williams1, Brendan Mumey2.
Abstract
Recent work provides the first method to measure the relative fitness of genomic variants within a population that scales to large numbers of genomes. A key component of the computation involves finding maximal perfect haplotype blocks from a set of genomic samples for which SNPs (single-nucleotide polymorphisms) have been called. Often, owing to low read coverage and imperfect assemblies, some of the SNP calls can be missing from some of the samples. In this work, we consider the problem of finding maximal perfect haplotype blocks where some missing values may be present. Missing values are treated as wildcards, and the definition of maximal perfect haplotype blocks is extended in a natural way. We provide an output-linear time algorithm to identify all such blocks and demonstrate the algorithm on a large population SNP dataset. Our software is publicly available.Entities:
Keywords: Bioinformatics; Genetics; Genomics
Year: 2020 PMID: 32446220 PMCID: PMC7243190 DOI: 10.1016/j.isci.2020.101149
Source DB: PubMed Journal: iScience ISSN: 2589-0042
Figure 1Six Sequences with Three SNPs Each
There can be many possible maximal perfect wildcard haplotype blocks if many wildcards are present; in this example, there are 22 possible blocks, e.g., is one of the blocks.
Summary of Experiments Varying the Wildcard Frequency and Minimum Block Area for an SNP Dataset for Human Chromosome 22 Consisting of 5,008 Sequences and 1,055,454 SNPs
| Wildcards | Min. Area | Run Time | # Blocks | #DFS Calls/Block | #SNP | |
|---|---|---|---|---|---|---|
| 0% | 13 min 28 s | 148,613,645 | 35.5 | 1,497.3 | 490.0 | |
| 0% | 500,000 | 19 min 37 s | 16,076,453 | 294.5 | 1,498.5 | 690.4 |
| 0% | 1,000,000 | 18 min 11 s | 2,228,762 | 1,888.4 | 1,659.1 | 889.2 |
| 0% | 2,000,000 | 13 min 47 s | 4,779 | 660,363.0 | 1,634.9 | 1,287.9 |
| 5% | 2 h 22 min | 506,675,436 | 30.8 | 545.9 | 426.1 | |
| 5% | 500,000 | 2 h 12 min | 18,155,762 | 815.9 | 1,477.3 | 710.9 |
| 5% | 1,000,000 | 1 h 47 min | 2,652,944 | 5,277.8 | 1,645.0 | 909.8 |
| 5% | 2,000,000 | 2 h 9 min | 13,387 | 926,786.2 | 1,546.8 | 1,380.9 |
| 10% | 5 h 32 min | 1,128,831,659 | 27.3 | 334.4 | 439.9 | |
| 10% | 500,000 | 4 h 18 min | 20,144,453 | 1,471.3 | 1,455.3 | 736.5 |
| 10% | 1,000,000 | 5 h 21 min | 3,101,221 | 9,157.3 | 1,627.0 | 1,627.0 |
| 10% | 2,000,000 | 5 h 3 min | 36,145 | 721,431.8 | 1,506.1 | 1,452.0 |
The rightmost three columns are averages.
Figure 2Scatterplots Showing the Distributions of Maximal Perfect Wildcard Haplotype Block Shapes Found for the Different Wildcard Rates and the Minimum Block Area Threshold Set to 500,000