| Literature DB >> 23555958 |
Abstract
In genome-wide association studies, results have been improved through imputation of a denser marker set based on reference haplotypes and phasing of the genotype data. To better handle very large sets of reference haplotypes, pre-phasing with only study individuals has been suggested. We present a possible problem which is aggravated when pre-phasing strategies are used, and suggest a modification avoiding the resulting issues with application to the MaCH tool, although the underlying problem is not specific to that tool. We evaluate the effectiveness of our remedy to a subset of Hapmap data, comparing the original version of MaCH and our modified approach. Improvements are demonstrated on the original data (phase switch error rate decreasing by 10%), but the differences are more pronounced in cases where the data is augmented to represent the presence of closely related individuals, especially when siblings are present (30% reduction in switch error rate in the presence of children, 47% reduction in the presence of siblings). The main conclusion of this investigation is that existing statistical methods for phasing and imputation of unrelated individuals might give results of sub-par quality if a subset of study individuals nonetheless are related. As the populations collected for general genome-wide association studies grow in size, including relatives might become more common. If a general GWAS framework for unrelated individuals would be employed on datasets with some related individuals, such as including familial data or material from domesticated animals, caution should also be taken regarding the quality of haplotypes. Our modification to MaCH is available on request and straightforward to implement. We hope that this mode, if found to be of use, could be integrated as an option in future standard distributions of MaCH.Entities:
Mesh:
Year: 2013 PMID: 23555958 PMCID: PMC3610665 DOI: 10.1371/journal.pone.0060354
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Bad MCMC mixing for cases of double genotype sharing.
MaCH and similar approaches implement a Markov-chain Monte Carlo scheme where in each iteration the individual genotype resolutions are updated one by one, by mapping the genotypes. If two individuals contain identical marker genotypes for a longer stretch of markers, the Hidden Markov Model will give the other individual a probability approaching . When no reference haplotypes are provided, all haplotype data is initialized randomly. In this series of panels, individuals and are initialized differently (a). In panel (b), A is updated. With high probability, the existing (random) haplotype resolution from is copied. When is updated (c), is sampled with high probability, replicating the original random data for . In iteration 2, is updated again (d), but again is sampled with high probability. Since any haplotype resolution for will match the genotypes for , there is no pressure to identify a better resolution. The two individuals form a local feedback loop with no true mixing in the Markov chain. Our modified algorithm lowers the probability of sampling from a mirror individual (like the pair of and ), thus allowing haplotypes from other individuals in the dataset to influence the final resolution. Similar cases can also arise with larger groups of individuals than . Those are handled successfully by our remedy, as well.
Comparison between original and modified MaCH.
| Original MaCH | Modified MaCH | |||
| Dataset |
|
|
|
|
| Trio parents (no children) | 5408 | 3730 | 4915 | 3566 |
| Trio parents and children | 1907 | 3261 | 1350 | 3217 |
| Full siblings to parents | 9657 | 4611 | 5096 | 3616 |
| Monozygotic twins to parents | 42074 | 8787 | 6309 | 4016 |
Comparison between original MaCH and a modified version with our remedy, showing both the total number of switch errors and the number of incorrectly imputed alleles. The comparison is based on the 30 first phased Hapmap3 release 2 CEU trio parents [13]. Four versions are used: 1. the original dataset (only parents), 2. including their children, as well as 3. simulating siblings to parents, 4. simulating twins to parents. When children are excluded and no virtual siblings are present, no known relationships exist between the individuals in the dataset. Imputation performance was verified by reconstructing the half of the marker set () that was left out, using minimac [11], employing the remainder of the phased CEU trio data (57 individuals) as reference panel. All MaCH runs were executed for iterations, with rounds for minimac. Metrics are reported for only the original individuals, in order to aid comparisons.
In this case, the minimac run starting from the recombination frequencies determined by the original MaCH failed to converge at all, with errors for all markers. The results for original MaCH in this table row are based on the pre-phased haplotypes from original MaCH, but starting out with the recombination frequencies from the modified version, in order to allow the minimac imputation to complete at all.
Figure 2Comparison of switch error locations.
Switch errors for all markers for CEU trio parents on chromosome 21 plotted in order left-right, top-down ( markers). For each marker, red color intensity indicates the switch error rate for all 30 parents using the original MaCH 1.0.17 algorithm, while green intensity indicates the error rate using our proposed modification. Hence, yellow color indicates regions where errors are shared. The issue of bad chain mixing we describe for the original algorithm manifests as contiguous (horizontal) blocks of repeated switch errors using the original approach, while the error rate using the modified algorithm is 50% lower in total. The errors in the modified algorithm consist of events more evenly distributed. Several of those error locations coincide with errors from the original method. This figure also shows that even if overall haplotype quality in terms of error rate would be acceptable, some regions can still be heavily affected, and paradoxically those regions are the ones where multiple individuals share both haplotypes identical by descent.