| Literature DB >> 28185544 |
Andrea Bracciali1, Marco Aldinucci2, Murray Patterson3, Tobias Marschall4,5, Nadia Pisanti6,7, Ivan Merelli8, Massimo Torquati6.
Abstract
BACKGROUND: Haplotype phasing is an important problem in the analysis of genomics information. Given a set of DNA fragments of an individual, it consists of determining which one of the possible alleles (alternative forms of a gene) each fragment comes from. Haplotype information is relevant to gene regulation, epigenetics, genome-wide association studies, evolutionary and population studies, and the study of mutations. Haplotyping is currently addressed as an optimisation problem aiming at solutions that minimise, for instance, error correction costs, where costs are a measure of the confidence in the accuracy of the information acquired from DNA sequencing. Solutions have typically an exponential computational complexity. WHATSHAP is a recent optimal approach which moves computational complexity from DNA fragment length to fragment overlap, i.e., coverage, and is hence of particular interest when considering sequencing technology's current trends that are producing longer fragments.Entities:
Keywords: Future generation sequencing; Haplotyping; High-performance computing
Mesh:
Year: 2016 PMID: 28185544 PMCID: PMC5046197 DOI: 10.1186/s12859-016-1170-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
WHATSHAP profiling. Test for an input data sample with coverage 20 on a 2 CPU Xeon E5-2695 @2.4 GHz, 12-core x 2 context for each CPU, 64 Gb RAM
| Coverage | <15 | 15 | 16 | 18 | 20 | 22 | 24 | 26 | 28 | 30 |
|---|---|---|---|---|---|---|---|---|---|---|
| Time (ms) | <1 | 1.1 | 2.2 | 8.7 | 34.2 | 144.7 | 558.5 | 2352.7 | 9194.3 | 36622.7 |
Fig. 1The fragment matrix F
Fig. 2The cost matrix C(1,(R, S))
Fig. 3The FastFlow skeleton used in PWHATSHAP. Each entity is a concurrent thread. The Emitter (S) produces and schedules tasks towards a pool of Workers (Ws). Each Worker sends results to the Reducer (R) and asks for new tasks from S
Fig. 4Cost information about compatible partitionings between any two columns
Fig. 5The cost matrix C(2,(R, S))
Fig. 6Accuracy comparison amongst state of the art toolkits. (P)WHATSHAP (first-left in the histograms) is top in minimising errors as well as in properly phasing, together with HapCol. Data extracted from [33]
Fig. 7Accuracy as error rate for increasing coverages. The curves in figure show how the accuracy scales up (error rate decreases) with larger coverages. Curves represent data for Venter’s Chromosome 1 and 15 with substitution error rate 1 and 5 %. From coverage 15 to coverage 25 the error rate decreases by about 40 %. Based on data extracted from [33]
Overall speedup considered for the dataset filtered for different maximum coverage figures
| max cov. | Avg. time/col. (ms) | Speedup | |
|---|---|---|---|
|
|
| ||
| 16 | 0.3 | 0.3 | 1.0 |
| 18 | 0.6 | 0.6 | 1.0 |
| 20 | 2.4 | 2.3 | 1.1 |
| 22 | 11.1 | 5.2 | 2.1 |
| 24 | 47.4 | 14.3 | 3.3 |
| 26 | 180.9 | 44.7 | 4.0 |
| 28 | 1462.5 | 287.9 | 5.0 |
Speedup on columns with a specific coverage and % of dataset with the given coverage. Dataset is filtered for max coverage 28
| col. cov. | % of dataset | Avg. time/col. (ms) | Speedup | |
|---|---|---|---|---|
|
|
| |||
| 16 | 2.0 % | 2.3 | 2.3 | 1.0 |
| 18 | 2.4 % | 9.0 | 9.0 | 1.0 |
| 20 | 2.5 % | 35.7 | 32.8 | 1.1 |
| 22 | 3.6 % | 153.1 | 41.4 | 3.6 |
| 24 | 3.2 % | 557.1 | 139.6 | 3.9 |
| 26 | 2.8 % | 2461.0 | 585.3 | 4.2 |
| 28 | 12.0 % | 9555.5 | 1175.5 | 5.3 |