| Literature DB >> 32938926 |
Emily Berger1,2,3, Deniz Yorukoglu1, Lillian Zhang1, Sarah K Nyquist1, Alex K Shalek1, Manolis Kellis1, Ibrahim Numanagić4,5,6, Bonnie Berger7,8.
Abstract
Haplotype reconstruction of distant genetic variants remains an unsolved problem due to the short-read length of common sequencing data. Here, we introduce HapTree-X, a probabilistic framework that utilizes latent long-range information to reconstruct unspecified haplotypes in diploid and polyploid organisms. It introduces the observation that differential allele-specific expression can link genetic variants from the same physical chromosome, thus even enabling using reads that cover only individual variants. We demonstrate HapTree-X's feasibility on in-house sequenced Genome in a Bottle RNA-seq and various whole exome, genome, and 10X Genomics datasets. HapTree-X produces more complete phases (up to 25%), even in clinically important genes, and phases more variants than other methods while maintaining similar or higher accuracy and being up to 10× faster than other tools. The advantage of HapTree-X's ability to use multiple lines of evidence, as well as to phase polyploid genomes in a single integrative framework, substantially grows as the amount of diverse data increases.Entities:
Mesh:
Year: 2020 PMID: 32938926 PMCID: PMC7494856 DOI: 10.1038/s41467-020-18320-z
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1HapTree-X framework compared to read-based phasing.
Traditional whole-genome sequencing (WGS) based phasing methods (top panel) depend on sequence contiguity and thus require a pair of SNPs (in red) to be connected through a common read that overlaps both in order to be phased. RNA-seq reads provide longer distance phasing capability due to long introns in the genome that are spliced-out in the sequenced transcript fragments (middle panel), yet SNPs that are far apart within the transcript due to long homozygous exonic regions are still difficult to phase using RNA-seq reads. Our HapTree-X framework (lower panel) overcomes this limitation by integrating RNA-seq reads and differential allele-specific expression (DASE) available from the RNA-seq data into a single probabilistic framework for haplotype phasing. For genes that display differential haplotypic expression (DHE), the majority of alleles can be phased together to obtain a single haplotype block for the entire gene. Depending on the DHE and depth-coverage, DASE-based phasing performs accurate haplotype reconstruction, without requiring paired-end or long reads, maintaining or improving on accuracy independent of gene/exon lengths as long as differential haplotypic expression is consistent across the loci being phased.
Comparison of phasing quality for four different phasers: HapCUT, HapCUT2, phASER, and HapTree-X on 9 RNA-seq datasets with varying transcriptomic coverage and on four different RNA-seq datasets combined with the NA12878 exome dataset.
| HapCUT | HapCUT2 | phASERa | HapTree-X | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GIAB (low coverage) | ||||||||||||
| NA12878 | N/A | N/A | 3,238 | 3387 | 1.98 | |||||||
| NA24143 | N/A | N/A | 2,399 | 3114 | ||||||||
| NA24149 | 6,696 | 9,322 | 6,677 | 9,306 | 2,984 | 1.37 | 3125 | |||||
| NA24385 | 7,079 | 1.75 | 8,100 | 7,055 | 1.86 | 7,971 | 3,896 | 3713 | 1.64 | |||
| NA24631 | 7,888 | 10,355 | 7,866 | 10,026 | 3,919 | 6303 | ||||||
| K562 leukemia cell line (medium coverage) | ||||||||||||
| K562 | 9993 | 0.96 | 4990 | 9972 | 0.82 | 3960 | 6770 | 2583 | 0.70 | |||
| GM12878 (high coverage) | ||||||||||||
| Cytosol | 28,706 | 2.61 | 18,724 | 28,699 | 2.62 | 18,441 | 14,451 | 11,846 | 2.59 | |||
| Nucleus | 31,420 | 2.23 | 21,249 | 31,418 | 2.23 | 21,208 | 17,377 | 13,137 | 2.19 | |||
| Whole | 30,520 | 1.91 | 18,960 | 30,520 | 1.89 | 18,960 | 15,420 | 10,932 | 1.94 | |||
| NA12878 exome data (low coverage) with RNA-seq data | ||||||||||||
| GIAB | 181,442 | 1.26 | 16,506 | 180,054 | 1.03 | 16,244 | 6188 | 1.55 | 5272 | |||
| Cytosol | 205,184 | 1.44 | 37,036 | 203,873 | 1.29 | 36,790 | 31,961 | 18,348 | 1.23 | |||
| Nucleus | 211,743 | 1.37 | 46,259 | 210,621 | 1.25 | 44,854 | 66,044 | 23,480 | 1.23 | |||
| Whole | 209,252 | 1.31 | 37,773 | 208,060 | 1.15 | 37,375 | 54,475 | 17,039 | 1.15 | |||
Cells contain the number of SNPs phased, switch error (SE) rate, and total length of phased blocks (span) in kilobases by a phaser for a dataset. Bold values represent the best overall results for a metric in the dataset. Overall, HapTree-X consistently phases more SNPs with comparable or lower switch error rates and longer phased blocks.
N/A a tool was not able to successfully complete the phasing.
aphASER, as a rule, uses more stringent filtering and thus achieves lower switch rate while phasing order of magnitude less SNPs than the other tools; however, HapTree-X’s SE rates are comparable if we restrict it to the same phasing blocks.
Comparison of HapCUT2 and HapTree-X (single-threaded mode) on WGS and 10X Genomics datasets.
| HapCUT2 | HapTree-X | |||
| NA12878 whole-genome sequencing (WGS) | 1:38:38 | (16.21) | ( | |
| NA12878 WGS with nucleus RNA | 1:48:01 | (16.41) | ( | |
| 10X Genomics NA12878 | 22:07:05 | (1.11) | ( | |
| 10X Genomics NA24385 | 22:13:43 | (4.83) | ( | |
Cells contain runtime and switch error rate (in parenthesis). Bold values represent the best overall results for a metric in the dataset. HapTree-X is from 3 to 10× faster than HapCUT2 while providing better or comparable switch rates. Time units are in h:mm:ss.
Comparison of runtime between different phasing tools (in format (h):mm:ss) on a few representative samples (all other samples display the similar ratios between runtimes).
| HapCUT | HapCUT2 | phASER | HapTree-X | HapTree-X (4 threads) | |
|---|---|---|---|---|---|
| GIAB (NA24149) | 3:03 | 1:30 | 1:08 | 0:27 | |
| GM12878 (Nucleus) | 21:10 | 12:59 | 16:55 | 3:15 | |
| Exome (Whole) | 31:21 | 17:13 | 35:03 | 5:10 | |
| Exome (Cytosol) | 25:46 | 12:57 | 24:23 | 3:36 | |
| WGS (Nucleus) | N/A | 1:48:01 | N/A | 23:16 | |
| 10X (NA12878) | N/A | 22:07:05 | N/A | 57:37 |
Bold values represent the fastest runtime in single-threaded mode on a dataset. HapTree-X is clearly the fastest phaser, being up to 10× faster than the fastest competitor. N/A indicates that the tool was not evaluated on that sample.
Fig. 2Phasing of nine disease-associated genes by HapTree-X, HapCUT2, and phASER using whole-cell RNA-seq data from GM12878.
Unphased SNPs are represented by an empty circle, and each phased block is given a unique color. Note that some blocks might overlap because not all SNPs from a gene exhibit DASE. Reported SNP loci are relative to the human genome hg19 (GRCh37).
Fig. 3Phasing of the BCR gene by HapTree-X, HapCUT2, and phASER on a selection of four GEUVADIS RNA-seq samples.
Unphased SNPs are represented by an empty circle, and each phased block is given a unique color. Reported SNP loci are relative to the human genome hg19 (GRCh37).
A toy phasing example on five SNPs: the counts of mutant/reference allele observations for each SNP (left) and the inferred haplotypes (right), assuming that the differential haplotype expression was β = 0.9.
| Allele/SNP | 1 | 2 | 3 | 4 | 5 | Allele/SNP | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Reference | 12 | 15 | 79 | 97 | 11 | ⟶ | Reference | 0 | 0 | 1 | 1 | 0 |
| Mutant | 92 | 85 | 7 | 4 | 84 | Mutant | 1 | 1 | 0 | 0 | 1 |