| Literature DB >> 21388557 |
John M Hickey1, Brian P Kinghorn, Bruce Tier, James F Wilson, Neil Dunstan, Julius H J van der Werf.
Abstract
BACKGROUND: Knowing the phase of marker genotype data can be useful in genome-wide association studies, because it makes it possible to use analysis frameworks that account for identity by descent or parent of origin of alleles and it can lead to a large increase in data quantities via genotype or sequence imputation. Long-range phasing and haplotype library imputation constitute a fast and accurate method to impute phase for SNP data.Entities:
Mesh:
Year: 2011 PMID: 21388557 PMCID: PMC3068938 DOI: 10.1186/1297-9686-43-12
Source DB: PubMed Journal: Genet Sel Evol ISSN: 0999-193X Impact factor: 4.297
Figure 1Illustration of the long range phasing process.
Figure 2A core and its adjacent tails.
Phasing performance for real data sets
| Dataset | 1Nb individuals | 2Nb SNP | 3Core/CplusT length | 4M/E% | 5Time | 6% phased |
|---|---|---|---|---|---|---|
| Sheep chr. 4 | 1019 | 2278 | 100/300 | 1.00 | 3 min 39 s | 98.17 |
| Sheep chr. 5 | 1016 | 1927 | 100/400 | 1.00 | 5 min 1 s | 97.62 |
| Pig chr. 1 | 2723 | 3999 | 100/500 | 0.00 | 364 min | 96.87 |
| Beef chr. 24 | 2171 | 874 | 100/300 | 0.00 | 17 min 8 s | 98.42 |
| Dairy chr. 1 | 5057 | 2296 | 100/400 | 0.00 | 456 min | 97.99 |
| Human chr. 1 | 879 | 4472 | 100/300 | 1.00 | 3 min 29 s | 93.73 |
1Numbers of individuals in the dataset; 2Numbers of SNP to be phased; 3Optimal core and CplusT length parameter; 4Optimal missing genotype/genotype error % (M/E%) error threshold parameter; 5Computation time measured on a 64 bit desktop with an Intel i7 3.07 GHz quad core processor running Linux was used to measure computation time; computation time includes time required to parse and summarise the data and write out the results; 6Percentage of alleles phased
Percentage of alleles correctly/incorrectly phased by the most optimal setings1 for the simulated data sets
| Ne 100 | Ne 1000 | |||
|---|---|---|---|---|
| with pedigree | without pedigree | with pedigree | without pedigree | |
| Pedigree 1 | 99.11/0.30 | 99.11/0.30 | ||
| Pedigree 2 NPG2 | 97.85/0.43 | 98.49/0.63 | 97.73/0.29 | 97.88/0.39 |
| Pedigree 2 PG3 | 98.85/0.49 | 99.03/0.42 | 99.70/0.17 | 99.48/0.17 |
| Pedigree 3 NPG2 | 98.35/0.38 | 98.61/0.63 | 99.23/0.14 | 99.14/0.27 |
| Pedigree 3 PG3 | 99.23/0.37 | 99.05/0.41 | 99.76/0.16 | 99.58/0.13 |
| Pedigree 4 NPG2 | 98.20/0.41 | 98.61/0.63 | 98.19/0.41 | 98.61/0.63 |
| Pedigree 4 PG3 | 99.35/0.31 | 99.29/0.32 | 99.74/0.20 | 99.59/0.15 |
| Pedigree 5 | 97.59/0.42 | 98.28/0.60 | 99.30/0.30 | 99.31/0.22 |
| Pedigree 6 sires | 97.05/0.45 | 98.40/0.62 | 99.05/0.17 | 99.25/0.20 |
| Pedigree 6 last 2000 | 98.24/0.39 | 98.24/0.39 | 99.34/0.20 | 99.42/0.26 |
| Pedigree 7 sires | 97.56/0.40 | 98.71/0.50 | 98.98/0.20 | 99.15/0.29 |
| Pedigree 7 last 2000 | 96.86/0.46 | 98.40/0.66 | 98.85/0.20 | 99.34/0.26 |
| Pedigree 8 | 95.01/1.10 | 96.67/1.39 | 96.02/0.57 | 96.36/1.01 |
1Optimal settings refer to the optimal core and CplusT lengths; core length was 100 SNP, CplusT length varied between 300 and 500 SNP; 2NPG indicates that this dataset did not have parents genotyped; 3PG indicates that this dataset had parents genotyped
Figure 3Effect of core and CplusT lengths on the percentage of alleles correctly phased for pedigree 1 Ne.
Figure 4Effect of core and CplusT lengths on the percentage of alleles correctly phased for pedigree 1 Ne.
Figure 6X-Y plot for percentage incorrectly phased and percentage correctly phased for all core and CplusT lengths tested on pedigree 1 Ne.
Computation time1 required to phase 2000 2SNP for the simulated Ne1000 data when using or ignoring pedigree information
| Number of individuals | With pedigree | Without pedigree | |
|---|---|---|---|
| Pedigree 1 | 2000 | 30 min 34 s | |
| Pedigree 2 NPG3 | 1000 | 4 min 3 s | 3 min 3 s |
| Pedigree 2 PG4 | 2000 | 39 min 21 s | 26 min 58 s |
| Pedigree 3 NPG3 | 1000 | 7 min 56 s | 5 min 20 s |
| Pedigree 3 PG4 | 1600 | 18 min 23 s | 17 min 11 s |
| Pedigree 4 NPG3 | 1000 | 9 min30 s | 50 min 27 s |
| Pedigree 4 PG4 | 1510 | 17 min 4 s | 10 min 48 s |
| Pedigree 5 | 3000 | 106 min 37 s | 179 min 52 s |
| Pedigree 6 sires | 2578 | 78 min 22 s | 39 min 43 s |
| Pedigree 6 last 2000 | 2000 | 48 min 24 s | 321 min 27 s |
| Pedigree 7 sires | 1777 | 23 min 47 s | 22 min 18 s |
| Pedigree 7 last 2000 | 2000 | 54 min 1 s | 74 min 15 s |
| Pedigree 8 | 879 | 4 min 14 s | 2 min 41 s |
1A 64 bit desktop with an Intel i7 3.07 GHz quad core processor running Linux was used to measure computation time; computation time measured in minutes and seconds includes time required to parse and summarise the data and write out the results; 2using a core length of 100 SNP and a CplusT length of 300 SNP for each dataset; 3NPG indicates that this dataset did not have parents genotyped; 4PG indicates that this dataset had parents genotyped