| Literature DB >> 25473435 |
Gustavo Glusman1, Hannah C Cox1, Jared C Roach1.
Abstract
Genomic information reported as haplotypes rather than genotypes will be increasingly important for personalized medicine. Current technologies generate diploid sequence data that is rarely resolved into its constituent haplotypes. Furthermore, paradigms for thinking about genomic information are based on interpreting genotypes rather than haplotypes. Nevertheless, haplotypes have historically been useful in contexts ranging from population genetics to disease-gene mapping efforts. The main approaches for phasing genomic sequence data are molecular haplotyping, genetic haplotyping, and population-based inference. Long-read sequencing technologies are enabling longer molecular haplotypes, and decreases in the cost of whole-genome sequencing are enabling the sequencing of whole-chromosome genetic haplotypes. Hybrid approaches combining high-throughput short-read assembly with strategic approaches that enable physical or virtual binning of reads into haplotypes are enabling multi-gene haplotypes to be generated from single individuals. These techniques can be further combined with genetic and population approaches. Here, we review advances in whole-genome haplotyping approaches and discuss the importance of haplotypes for genomic medicine. Clinical applications include diagnosis by recognition of compound heterozygosity and by phasing regulatory variation to coding variation. Haplotypes, which are more specific than less complex variants such as single nucleotide variants, also have applications in prognostics and diagnostics, in the analysis of tumors, and in typing tissue for transplantation. Future advances will include technological innovations, the application of standard metrics for evaluating haplotype quality, and the development of databases that link haplotypes to disease.Entities:
Year: 2014 PMID: 25473435 PMCID: PMC4254418 DOI: 10.1186/s13073-014-0073-7
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Figure 1Overview of the three methods for phasing whole-genome sequence data. Phasing is achieved by (1) molecular methods, (2) genetic analysis, or (3) population inference. Molecular methods focus on individual samples and involve either (a) processing genomic DNA prior to sequencing or (b) leveraging the single-molecule characteristic of each physical read. Genetic analysis and population inference process sequenced genomes from pedigrees and population cohorts, respectively.
Figure 2The properties of phasing methods. The level of confidence in phasing and the achievable range of phased sequence length both vary depending on the method used. Molecular methods provide direct observations from single molecules and therefore the level of confidence in the results is high. The phased sequence length that can be achieved by these methods has a wide range, which depends on the method employed. Molecular observations can be assembled into haplotypes (dashed arrows), adding moderately to the range of phased sequence length, and potentially introducing inference error. Genetic analyses infer phase by leveraging the property of Mendelian segregation and can phase entire chromosomes. Population inference methods are probabilistic and limited to the generation of short-range haplotype blocks.
Overview of whole-genome haplotyping methods
|
|
|
|
| |
|---|---|---|---|---|
| Molecular | Single and paired-end physical reads | Individual | Haplotype is directly observed from sequence data | Produces short haplotypes, even after assembly |
| Simple | ||||
| Can resolve private and rare haplotypes | ||||
| Can phase | ||||
| Chromosome sorting, clone-by-clone, dilution, proximity ligation | Individual | Haplotype is directly observed from sequence data | May be labor intensive, time-consuming and expensive, therefore | |
| Highly accurate | difficult to translate to large sample sizes | |||
| Can resolve private and rare haplotypes | ||||
| Can phase | ||||
| Can resolve long-range and chromosome-length haplotypes (depending on method) | ||||
| Ideal for generating personalized genome-resolved haplotypes | ||||
| Haplotype assembly | Individual | Leverages molecular haplotype information from WGS data and/or from sorted chromosomes, clones | Assembly requires variants in overlapping sequence reads | |
| Works well when molecular haplotypes are long (that is, from cosmid or BAC) | Limited by the accuracy and availability of suitable reference data | |||
| Generate short-range haplotypes | ||||
| May introduce phase errors | ||||
| Genetic analysis | Trios, nuclear families | Can accurately phase high-throughput short-read sequencing reads | Cannot resolve sites where all family members are heterozygous | |
| Low error rate | ||||
| Precisely maps recombinations and inheritance states | May not be possible to ascertain family members | |||
| Enables detection of sequencing errors | ||||
| Can phase private and rare alleles | ||||
| Can phase entire chromosomes | ||||
| Suitable for clinical applications | ||||
| Population inference | Unrelated individuals, duos, trios | Cost-effective | Can only phase common variants | |
| Facilitates haplotype imputation in samples with low-density microarray panels | Difficult to impute private variants or rare haplotypes | |||
| Useful when family members cannot be ascertained | Limited by the accuracy and availability of suitable reference data | |||
| Large sample sizes increase accuracy | Generates short-range haplotypes | |||
| Good for large samples of unrelated individuals | Sample size impacts haplotype frequency estimations | |||
| Incorporation of family duos and trios improves accuracy | Methods are probabilistic and accuracy must be balanced against computational costs | |||
*All of these methods are limited by the accuracy of the sequence data.
Summary of selected software available for whole-genome haplotyping
|
|
|
|
|
|---|---|---|---|
| Molecular - haplotype assembly |
| A combinatorial approach implementing a max-cut-based algorithm and optimized minimum error correction (MEC) solution | [ |
|
| A collection of algorithms including | [ | |
|
| Heuristic algorithm for optimizing a combination of the MEC and Maximum Fragments Cut models | [ | |
|
| Probabilistic mixture model | [ | |
|
| Markov chain Monte Carlo algorithm | [ | |
| Genetic analysis |
| Implements a parsimony approach to generate inheritance state vectors and a hidden Markov model to deduce haplotypes | [ |
| Population inference |
| Phased input data are used to build a local haplotype cluster model, which is sampled using a hidden Markov model. Iterations and the Viterbi algorithm are used to select the ‘most likely’ haplotype | [ |
|
| Enhancement of | [ | |
|
| Implements a hashing-algorithm approach to identifying whole-haplotype segment sharing | [ | |
|
| Pre-phasing, imputation and haplotype sampling strategy incorporating a Monte Carlo algorithm and Markov model calculations | [ | |
|
| Implements a Markov Chain algorithm for genotype imputation and haplotyping | [ | |
|
| Implements Bayesian haplotype reconstruction | [ | |
|
| Implements hidden Markov model sampling | [ | |
|
| A population imputation pipeline that generates genotype likelihoods using a binary sequence map-specific binomial mixture model. Haplotypes are then sampled using a hidden Markov model | [ | |
|
| Scalable sliding windows are used to optimize haplotypes and a parsimony approach iteratively restricts the number of solutions | [ | |
| Combination strategies |
| Sampling within a probabilistic model combining read data with a reference panel of haplotypes. Successor to | [ |
|
| Adds short-read molecular information to population inference | [ | |
|
| Combines haplotype assembly and population inference | [ | |
|
| Implements a phylogeny model to estimate haplotype frequencies recursively using the expectation maximization algorithm | [ | |
|
| Integrates physical, genetic and population phasing | [ |
Abbreviations: OSS open source software, MEC minimum error correction.
Figure 3Switch errors and quality metrics. We present four scenarios involving low- and high-frequency switch errors and the resulting values for two quality metrics: total switch rate, and the pairwise metric 2* cse-1, which we term the ‘phase accuracy’ (Box 3). The latter was computed on the assumption that each local ‘back and forth’ switch error affects 1% of the markers in the chromosome. The true haplotypes (maternal (M) and paternal (P)) are shown at the top. In Example 1, there is a single phase-error switch at ‘a’; variants on opposite sides of the error are incorrectly phased (denoted by the ‘x’ on the curved arrow). This situation might arise if a parent’s haplotype is inferred by subtraction of a child’s haplotype, missing a meiotic recombination. Example 1 thus has a low-frequency phase error but no high-frequency errors. In Example 2, there are many high-frequency errors, but no low-frequency errors. Most variant pairs are separated by even numbers of switch errors, and are thus properly phased (curved arrows). This error pattern might arise from a long-read technology such as strobe sequencing. In Example 3, there are three switch errors, one at ‘a’ and two at ‘b’. This haplotype could have arisen from a false haplotype assembly join at ‘a’ and a sequence error of a single base at ‘b’. Example 3 thus has a mix of low- and high-frequency errors. In terms of the pairwise metric, the single switch in Example 1 most strongly affects the long-range haplotype quality, while the several localized switch errors (each pair affecting just one or a few markers) degrade haplotype quality only modestly.