| Literature DB >> 24561555 |
Volodymyr Kuleshov1,2, Dan Xie3, Rui Chen3, Dmitry Pushkarev2, Zhihai Ma3, Tim Blauwkamp2, Michael Kertesz2, Michael Snyder3.
Abstract
The rapid growth of sequencing technologies has greatly contributed to our understanding of human genetics. Yet, despite this growth, mainstream technologies have not been fully able to resolve the diploid nature of the human genome. Here we describe statistically aided, long-read haplotyping (SLRH), a rapid, accurate method that uses a statistical algorithm to take advantage of the partially phased information contained in long genomic fragments analyzed by short-read sequencing. For a human sample, as little as 30 Gbp of additional sequencing data are needed to phase genotypes identified by 50× coverage whole-genome sequencing. Using SLRH, we phase 99% of single-nucleotide variants in three human genomes into long haplotype blocks 0.2-1 Mbp in length. We apply our method to determine allele-specific methylation patterns in a human genome and identify hundreds of differentially methylated regions that were previously unknown. SLRH should facilitate population-scale haplotyping of human genomes.Entities:
Mesh:
Year: 2014 PMID: 24561555 PMCID: PMC4073643 DOI: 10.1038/nbt.2833
Source DB: PubMed Journal: Nat Biotechnol ISSN: 1087-0156 Impact factor: 54.908
Figure 1Statistically aided long read haplotyping (a) Overview of the library preparation protocol. The subject's DNA (1) is sheared into fragments of about 10 kbp (2), which are then diluted and placed into 384 wells, at about 3,000 fragments per well (3). Within each well, fragments are amplified through long-range PCR, cut into short fragments and barcoded (4), before being finally pooled together and sequenced (5). (b) Overview of the bioinformatics pipeline. Sequenced short reads are aligned and mapped back to their original well using the barcode adapters (1). Within each well, reads are grouped into fragments (2), which are assembled at their overlapping heterozygous SNVs into haplotype blocks (3). These blocks are assigned a phase statistically based on a phased reference panel (4), which produces very long haplotype contigs (5).
Summary of haplotyping performance. We used SLRH to phase three human genomes from the HapMap project. Two libraries were prepared for each subject, and each was evaluated at a fixed accuracy threshold.
| Haplotype block N50 length (bp) | Phasing rate over SNVs | Switches per Mbp | |
|---|---|---|---|
|
| 563,801 | 99.00% | 0.47 |
|
| 647,599 | 99.25% | 0.68 |
| 531,804 | 98.84% | 0.75 | |
|
| 401,342 | 98.49% | 0.51 |
|
| 405,472 | 98.44% | 0.49 |
Figure 2Haplotyping results at several accuracy thresholds. Long statistically constructed haplotype contigs are cut at positions where confidence scores are below a certain threshold (x axes), forming shorter but more accurate haplotype blocks. We evaluate the completeness (top panels) and the switch accuracy (bottom panels) of the smaller blocks at a series of thresholds. The blocks are evaluated only over SNVs.
Overview of heterozygocity patterns. Within each subject, SLRH phases about 90% of all genes. Of the genes containing variants, the majority (85%) contains heterozygous variants and a very large fraction (74%) contains compound heterozygous variants. Moreover, the phased genes harbor about 2,500 SNVs that were found to be damaging by the SIFT software package. About 1,500 genes are affected by such variants, and about 500 have both of their copies damaged.
| NA12878 | NA12891 | NA12892 | |
|---|---|---|---|
|
| 23,410 | 23,410 | 23,410 |
|
| 21,018 | 20,804 | 20,711 |
|
| 14,799 | 14,630 | 14,571 |
|
| 12,634 | 12,573 | 12,339 |
|
| 11,076 | 10,970 | 10,790 |
|
| 2,460 | 2,422 | 2,323 |
|
| 1,597 | 1,667 | 1,583 |
|
| 1,573 | 1,579 | 1,507 |
|
| 518 | 481 | 466 |
Figure 3Haplotyping performance from 30 Gbp of sequencing. We ran the bioinformatics pipeline independently on two 30 Gbp replicate libraries of the sample NA12878. The resulting haplotype blocks are almost as accurate and only 100 kbp shorter than ones derived from two phasing libraries. Moreover, results from the two replicates are highly concordant.
Figure 4Genome browser view of differentially methylated regions at the promoter of the H19 gene. Differences in DNA methylation levels (green tracks, D) and the absolute DNA methylation level at the two parental alleles (blue tracks for paternal methylation (P) and red tracks for maternal methylation (M)) are shown around the H19 locus. The shaded regions show significant (P<0.05; Fisher's exact test) difference in DNA methylation levels between the two parental alleles and are identified as a DMR.