| Literature DB >> 32951599 |
Sven D Schrinner1, Rebecca Serra Mari2,3,4, Jana Ebler2, Mikko Rautiainen3,4,5, Lancelot Seillier6, Julia J Reimer6, Björn Usadel7,6,8, Tobias Marschall9, Gunnar W Klau10,11.
Abstract
Resolving genomes at haplotype level is crucial for understanding the evolutionary history of polyploid species and for designing advanced breeding strategies. Polyploid phasing still presents considerable challenges, especially in regions of collapsing haplotypes.We present WHATSHAP POLYPHASE, a novel two-stage approach that addresses these challenges by (i) clustering reads and (ii) threading the haplotypes through the clusters. Our method outperforms the state-of-the-art in terms of phasing quality. Using a real tetraploid potato dataset, we demonstrate how to assemble local genomic regions of interest at the haplotype level. Our algorithm is implemented as part of the widely used open source tool WhatsHap.Entities:
Keywords: Cluster editing; Haplotypes; High-throughput nucleotide sequencing; Phasing; Plant science; Polyploidy; Sequence analysis
Mesh:
Year: 2020 PMID: 32951599 PMCID: PMC7504856 DOI: 10.1186/s13059-020-02158-1
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1The MEC model in collapsed regions. Frequently, in polyploid organisms, two or more haplotypes are locally identical on larger stretches of sites, as shown by the pink and blue haplotype sequences in the left picture. The MEC model favors assigning the reads of two haplotypes to only one partition, because the spare partition can be used to collect noisy reads, which gives a lower MEC score but also results in unbalanced and likely wrong partitions
Fig. 2Overview of WHATSHAP POLYPHASE. The input allele matrix results from a given BAM and VCF file and an optional realignment step. Phase I: statistical scoring of each read pair classifies them into belonging to the same or to different haplotypes. The scores are used as weights for a graph over all reads, which is clustered by cluster editing (gray round shapes). Phase II threads k haplotypes (colored lines) through the clusters (here k=4) balancing coverage violations and switch costs while respecting the genotype information. This results in k phased haplotypes, subdivided into blocks (vertical lines)
Fig. 3N50 block lengths and the respective block-wise switch error rates for different block cut strategies of WHATSHAP POLYPHASE (default strategy marked by a circle) on the real tetraploid read dataset (top) and the simulated tetraploid dataset (bottom) with 40 × and 80 × coverage
Comparison of WHATSHAP POLYPHASE and H-POPG on tetraploid real (a) and simulated (b) datasets, pentaploid simulated dataset (c), and hexaploid simulated dataset (d). Performances are based on the switch error rate (SER), block-wise Hamming rate (HR), and N50 for the block size. For better comparability with H-POPG, a second setting (WH-PP*) with less block cuts was used. The total length of the chromosome is 249 Mb
| Coverage | Method | SER (%) | HR (%) | N50 (bp) | Runtime (s) | Memory (GB) |
|---|---|---|---|---|---|---|
| (a) Real tetraploid read data | ||||||
| 40 × | WH-PP | 0.58 | 1.48 | 29,529 | 3333 | 1.41 |
| WH-PP* | 1.39 | 28.72 | 1,692,352 | 3433 | 1.42 | |
| 2.01 | 27.53 | 1,785,293 | 2230 | 9.97 | ||
| 80 × | WH-PP | 0.31 | 1.43 | 54,434 | 12694 | 2.52 |
| WH-PP* | 0.74 | 28.27 | 2,587,104 | 13042 | 2.89 | |
| 1.24 | 27.66 | 2,587,104 | 4368 | 9.99 | ||
| (b) Simulated tetraploid read data | ||||||
| 40 × | WH-PP | 0.42 | 1.74 | 48,815 | 1960 | 1.10 |
| WH-PP* | 1.00 | 26.57 | 1,830,943 | 2004 | 1.17 | |
| 1.67 | 26.37 | 1,917,094 | 1414 | 9.96 | ||
| 80 × | WH-PP | 0.29 | 2.51 | 86,227 | 5738 | 1.78 |
| WH-PP* | 0.68 | 25.23 | 2,142,893 | 5865 | 2.04 | |
| 0.98 | 25.65 | 2,142,893 | 2843 | 9.97 | ||
| (c) Simulated pentaploid read data | ||||||
| 40 × | WH-PP | 0.86 | 1.57 | 22,625 | 2331 | 1.05 |
| WH-PP* | 2.01 | 25.34 | 1,361,459 | 2377 | 1.07 | |
| 3.50 | 24.78 | 1,453,040 | 2357 | 9.97 | ||
| 80 × | WH-PP | 0.47 | 1.18 | 33,438 | 5031 | 1.69 |
| WH-PP* | 1.33 | 23.64 | 1,701,753 | 5118 | 1.87 | |
| 2.24 | 24.76 | 1,748,404 | 4849 | 9.96 | ||
| (d) Simulated hexaploid read data | ||||||
| 40 × | WH-PP | 1.12 | 1.82 | 16,785 | 25841 | 1.30 |
| WH-PP* | 2.35 | 27.03 | 3,877,456 | 25860 | 1.79 | |
| 3.85 | 26.75 | 4,490,129 | 5450 | 9.96 | ||
| 80 × | WH-PP | 0.48 | 0.97 | 26,711 | 10331 | 1.98 |
| WH-PP* | 1.34 | 25.63 | 4,540,968 | 10827 | 2.63 | |
| 2.37 | 25.93 | 4,721,421 | 11563 | 10.89 | ||
Comparison between the resulting switch error rates of WHATSHAP POLYPHASE (WH-PP) and H-POPG on collapsing regions over at least 50 variants as compared to non-collapsing regions and the average throughout the genome. Results (switch error rates in %) are presented for Chromosome 1 of the real and simulated tetraploid dataset on both 40 × and 80 × coverage. The third row marks the quotient between the switch error rate of H-POPG and that of WHATSHAP POLYPHASE to highlight by which magnitude the results differ
| Coverage | Method | Collapsing regions | Non-collapsing regions | Total |
|---|---|---|---|---|
| (a) Real read data (tetraploid) | ||||
| 40 × | WH-PP | 0.29 | 0.69 | 0.60 |
| 2.02 | 2.16 | 2.02 | ||
| 6.97 | 3.13 | 3.37 | ||
| 80 × | WH-PP | 0.14 | 0.46 | 0.35 |
| 1.05 | 1.30 | 1.24 | ||
| 7.50 | 2.83 | 3.54 | ||
| (b) Simulated read data (tetraploid) | ||||
| 40 × | WH-PP | 0.18 | 0.45 | 0.43 |
| 2.01 | 1.63 | 1.68 | ||
| 11.17 | 3.62 | 3.91 | ||
| 80 × | WH-PP | 0.08 | 0.37 | 0.32 |
| 0.94 | 0.98 | 0.99 | ||
| 11.75 | 2.65 | 3.09 | ||
Fig. 4Phasing of potato genome. a Per-base coverage distribution of Illumina and ONT MinION alignments on Chr01. b Fraction of phased variants in relation to gene length. The x-axis shows the gene length and the y-axis the percentage of phased variants in the longest block. Axis histograms and hexagons illustrate the distribution of data points. c IGV [23] screenshot showing alignments of uncorrected (top) and corrected MinION reads (bottom) of FRIGIDA-like protein 5 isoform X2 gene on Chr04. The corrected reads are colored (red, green, blue, purple) according to the haplotypes WHATSHAP POLYPHASE assigned them to. d Multiple sequence alignment of the ORFs detected in the four haplotype sequences. The uppermost gray sequence represents the reference, and the others correspond to the four haplotypes (same order as in panel c)
Fig. 5Cluster editing example. The input graph on the left contains one node per read and positive weighted edges (blue) for similar reads and negative weighted edges (pink) for dissimilar reads. All other edges are zero-edges and not drawn for sake of simplicity. The model considers blue edges as present edges and pink edges as missing edges, as shown in the second graph. The information of the pink edges is still used as insertion cost for missing edges. The third graph indicates operations needed to get a clique graph as dashed edges. The blue edges need to be deleted, and the pink needs to be inserted. The final clique graph is shown on the right
Fig. 6Visualization of the threading. a Clusters of reads are represented as gray shapes with their horizontal span indicating the covered variants and the height being the respective coverage. The k=4 threads are shown as colored lines passing through the clusters. Multiple threads can co-enter the same cluster if the coverage is suited. b Alternative threading with the same score in our model. Two positions cause ambiguity and allow switches in the threading compared to a. These are candidate cut positions to prevent switch errors in the final phasing