| Literature DB >> 31510646 |
Abstract
MOTIVATION: Reconstruction of haplotypes for human genomes is an important problem in medical and population genetics. Hi-C sequencing generates read pairs with long-range haplotype information that can be computationally assembled to generate chromosome-spanning haplotypes. However, the haplotypes have limited completeness and low accuracy. Haplotype information from population reference panels can potentially be used to improve the completeness and accuracy of Hi-C haplotyping.Entities:
Mesh:
Year: 2019 PMID: 31510646 PMCID: PMC6612846 DOI: 10.1093/bioinformatics/btz329
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Integrating haplotype information from Hi-C reads and population reference panels to improve accuracy and completeness of haplotyping. Haplotypes assembled using HapCUT2 from Hi-C reads have three unphased variants (2, 9 and 15) and an incorrectly phased variant (#6) with respect to the large haplotype block due to an erroneous Hi-C read (edge connecting variants 6 and 14). Haplotypes estimated using a population reference panel provide accurate short-range phase information. This information can be combined with the Hi-C reads to phase two of the three variants with no sequence information and also correct the phase for variant #6
Comparison of the phasing completeness and accuracy on whole-genome Hi-C data for NA19240
| Method | SNVs phased (%) | Absolute error rate (%) | Switch error rate (%) | Mismatch rate (%) | Run time |
|---|---|---|---|---|---|
| Reads only | 51.30 | 0.49 | 0.20 | 0.365 | 02:43 |
| Integrated phasing | 97.32 | 0.31 | 0.034 | 0.266 | 08:57 |
| SHAPEIT2 | 98.67 | 42.1 | 0.27 | 0.76 | 04:57 |
Note: Results shown are from the analysis of chromosome 20 only. The run-time is reported as minutes:seconds.
Fig. 2.Completeness and accuracy of haplotyping using Hi-C data for NA12878 (all statistics are for chromosome 20 only). (A) Error rates for haplotypes estimated using HapCUT2 on the MboI and Arima Hi-C datasets, and the integrated phasing algorithm applied to the Arima Hi-C data. (B) Haplotyping completeness (percentage of SNVs phased) across the three different methods. (C) Distribution of read-depth across SNV sites using the Arima and MboI Hi-C datasets (36× coverage). (D) Haplotype completeness for Arima Hi-C data as a function of sequence coverage
Phasing completeness and accuracy on whole-genome Strand-seq data for NA12878
| Method | SNVs phased (%) | Switch error rate (%) | Mismatch error rate (%) | Absolute error rate (%) |
|---|---|---|---|---|
| Reads only | 71.38 | 0.091 | 0.268 | 0.905 |
| Integrated phasing | 94.56 | 0.0364 | 0.134 | 0.868 |
Note: Results are shown for data on chromosome 20 only. Switch and mismatch error rates were calculated by comparison to Platinum Genomes haplotypes for NA12878.