| Literature DB >> 28511677 |
Pierre Faux1, Tom Druet2.
Abstract
BACKGROUND: Haplotype reconstruction (phasing) is an essential step in many applications, including imputation and genomic selection. The best phasing methods rely on both familial and linkage disequilibrium (LD) information. With whole-genome sequence (WGS) data, relatively small samples of reference individuals are generally sequenced due to prohibitive sequencing costs, thus only a limited amount of familial information is available. However, reference individuals have many relatives that have been genotyped (at lower density). The goal of our study was to improve phasing of WGS data by integrating familial information from haplotypes that were obtained from a larger genotyped dataset and to quantify its impact on imputation accuracy.Entities:
Mesh:
Year: 2017 PMID: 28511677 PMCID: PMC5434521 DOI: 10.1186/s12711-017-0321-6
Source DB: PubMed Journal: Genet Sel Evol ISSN: 0999-193X Impact factor: 4.297
Importance of the familial information for the 91 animals of the WGS dataset
| Phased with 58,369 genotyped animals | Phased with 91 sequenced animals | |
|---|---|---|
| Both parents genotyped | 23 | 3 |
| Only one parent genotyped | 67 | 32 |
| At least one offspring genotyped (average number of offspringa) | 80 (178.6) | 17 (2.2) |
aThe average number of offspring genotyped is the average number of offspring considering only animals with at least one offspring
Size distribution of the WGS segmentsa encompassed by the scaffold (GEN-P2) (number of SNPs, physical length)
| GEN-P2 scaffold | ||
|---|---|---|
| Number of WGS SNPs per segment | Minimum | 1 |
| Average | 146.97 | |
| Median | 110 | |
| Maximum | 2241 | |
| Singleton segmentsb | Number | 145 |
| Scaffold proportion (%) | 0.41 | |
| Physical length of segments in bp | Minimum | 1 |
| Average | 67,834.76 | |
| Median | 53,507 | |
| Maximum | 1,703,836 | |
| Physical length of non-singleton segments in bpb | Minimum | 39 |
| Average | 68,114.66 | |
| Median | 53,715 | |
aWGS segments being defined as all consecutive WGS SNPs of the trusted set of SNPs for which the closest genotyped SNP is the same
b“Singleton segments” refers to segments that contain only one SNP from the scaffold, therefore a scaffold SNP encompassing only itself in the WGS data
Fig. 1Flowchart of all phasing and imputation steps. Synoptic view of the two phasing strategies (P1 with LD information only, P2 with both LD and familial information) applied to the two datasets (GEN 50k dense genotype array data, WGS whole-genome sequence data) and the two imputation scenarios
Statistics of phasing results for the two phasing strategies
| Trusted set of variants | Traditional SNP filtering | |||||||
|---|---|---|---|---|---|---|---|---|
| WGS-P1 | WGS-P2 | WGS-P1 | WGS-P2 | |||||
| Average | Median | Average | Median | Average | Median | Average | Median | |
|
| ||||||||
| Per animal | 50.95% | 50.13% | 0.38% | 0.32% | 50.80% | 50.41% | 1.10% | 1.04% |
|
| ||||||||
| Per animal | 739.2 | 631.5 | 704.5 | 574 | 4521 | 4291 | 4387.7 | 4079 |
| Per animal and chromosome | 25.49 | 18.5 | 24.29 | 16.5 | 155.9 | 105.5 | 151.3 | 112 |
WGS-P1 phased with LD information only, WGS-P2 phased with both LD and familial information
Lengths of segments without switches of the trusted set of WGS SNPs for the two phasing strategiesa, whether correctly or wrongly phased or both (all)
| WGS-P1 | WGS-P2 | |||||
|---|---|---|---|---|---|---|
| All | Correct | Wrong | All | Correct | Wrong | |
| Original segments | ||||||
| Physical lengthb | ||||||
| Avg | 3.01 | 2.96 | 3.07 | 3.19 | 6.11 | 84.99 kb |
| Med | 4.58 kb | 4.75 kb | 4.33 kb | 3.38 kb | 1.74 | 1 bp |
| Max | 150.73 | 150.73 | 123.75 | 116.39 | 116.39 | 42.76 |
| Proportion of singletonsc | 37.52% | 37.53% | 37.51% | 38.82% | 12.18% | 67.24% |
| Number of phasable SNPs per segment | ||||||
| Avg | 1048.45 | 1041.75 | 1055.17 | 1098.05 | 2119.04 | 8.39 |
| Med | 4 | 4 | 4 | 3 | 644 | 1 |
| Max | 55,553 | 55,553 | 50,217 | 51,340 | 51,340 | 835 |
| Number of SNPs per segment | ||||||
| Avg | 6232.95 | 6134.89 | 6331.09 | 6612.58 | 12,639.78 | 179.99 |
| Med | 12 | 13 | 12 | 8.5 | 3641 | 1 |
| Max | 308,884 | 308,884 | 240,332 | 231,051 | 231,051 | 85,165 |
| After discarding singletonsc | ||||||
| Physical lengthb | ||||||
| Avg | 9.48 | 9.28 | 9.67 | 11.02 | 19.46 | 0.36 |
| Med | 1.25 | 1.39 | 1.12 | 0.26 | 6.39 | 0.01 |
| Max | 154.47 | 154.47 | 147.3 | 147.3 | 147.3 | 42.76 |
| Number of phasable SNPs per segment | ||||||
| Avg | 3185.72 | 3163.65 | 3207.84 | 3652.87 | 6524.06 | 28.54 |
| Med | 367 | 418 | 314 | 59 | 2352 | 4 |
| Max | 67,776 | 67,776 | 60,281 | 67,776 | 67,776 | 1452 |
| Number of SNPs per segment | ||||||
| Avg | 19,586.14 | 19,210.75 | 19,962.22 | 22,778.53 | 40,230.9 | 748.21 |
| Med | 2555 | 2925.5 | 2306 | 523 | 13,713 | 15 |
| Max | 318,618 | 318,618 | 304,742 | 304,742 | 304,742 | 85,165 |
| After discarding segments with less than five phasable SNPs and shorter than 5 kb | ||||||
| Physical lengthb | ||||||
| Avg | 14.78 | 14.46 | 15.11 | 19.36 | 31.64 | 0.72 |
| Med | 4.62 | 4.72 | 4.51 | 2.28 | 19.59 | 0.06 |
| Max | 158.12 | 158.1 | 158.12 | 158.24 | 158.24 | 42.76 |
| Number of phasable SNPs per segment | ||||||
| Avg | 4956.66 | 4911.73 | 5001.86 | 6400.99 | 10,581.41 | 53.68 |
| Med | 1489 | 1584 | 1401 | 522 | 6580 | 13 |
| Max | 73,348 | 73,348 | 60,625 | 67,776 | 67,776 | 1452 |
| Number of SNPs per segment | ||||||
| Avg | 30,549.61 | 29,916.85 | 31,186.28 | 40,033.57 | 65,412.51 | 1499.57 |
| Med | 9884 | 10,364 | 9497.5 | 4913 | 41,646 | 138 |
| Max | 327,738 | 327,706 | 327,738 | 327,914 | 327,914 | 85,165 |
A segment is defined as a run of consecutive phasable SNPs without switches
a WGS-P1 phased with LD information only, WGS-P2 phased with both LD and familial information
bUnless specified, all length units are in Mb
c“Singleton” refers to segments that contain only one SNP
Distancea between any SNP of the trusted set of WGS SNPs and the closest switch
| WGS-P1 | WGS-P2 | |
|---|---|---|
| Average | 6.74 | 7.77 |
| Median | 3.49 | 4.08 |
| Maximum | 150.81 | 98.69 |
Distances are estimated on 30 animals of the training population
aIn Mb
Fig. 2Proportion of the genome by class of size of phased segments. Proportions of the genome in segments that are longer or equal to 5, 10, 20 or 50 Mb, regardless of whether they are correctly (grey) or incorrectly (black) phased, when phasing the WGS data using only LD information (WGS-P1) or both LD and familial information (WGS-P2)
Lengths of segments without switches obtained with WGS SNPs selected with more traditional filtering rules for the two phasing strategiesa, whether correctly or wrongly phased or both (all)
| WGS-P1 | WGS-P2 | |||||||
|---|---|---|---|---|---|---|---|---|
| All | Correct | Wrong | All | Correct | Wrong | |||
| Original segments | ||||||||
| Physical lengthb | ||||||||
| Avg | 0.50 | 0.49 | 0.50 | 0.51 | 0.99 | 0.04 | ||
| Med | 346 bp | 297 bp | 404 bp | 396 bp | 223.51 kb | 1 bp | ||
| Max | 34.88 | 34.88 | 21.33 | 25.64 | 25.64 | 14.65 | ||
| Proportion of singletonsc | 40.76% | 41.05% | 40.48% | 40.61% | 16.11% | 65.31% | ||
| Number of phasable SNPs per segment | ||||||||
| Avg | 373.70 | 370.72 | 376.69 | 384.98 | 758.49 | 8.40 | ||
| Med | 2 | 2 | 2 | 2 | 131 | 1 | ||
| Max | 29,712 | 29,712 | 26,646 | 32,839 | 32,839 | 1813 | ||
| Number of SNPs per segment | ||||||||
| Avg | 2618.11 | 2593.29 | 2642.94 | 2704.68 | 5179.74 | 209.22 | ||
| Med | 4 | 4 | 4 | 4 | 1228.5 | 1 | ||
| Max | 158,630 | 158,630 | 113,469 | 134,130 | 134,130 | 84,929 | ||
| After discarding singletonsc | ||||||||
| Physical lengthb | ||||||||
| Avg | 1.88 | 1.87 | 1.9 | 1.94 | 3.64 | 0.19 | ||
| Med | 0.05 | 0.05 | 0.05 | 0.03 | 1.04 | 1.92 kb | ||
| Max | 63.16 | 55.98 | 63.16 | 85.72 | 85.72 | 31.87 | ||
| Number of phasable SNPs per segment | ||||||||
| Avg | 1335.70 | 1325.27 | 1346.13 | 1373.47 | 2675.79 | 28.52 | ||
| Med | 14 | 14 | 14 | 10 | 659 | 3 | ||
| Max | 74,067 | 57,000 | 74,067 | 97,334 | 97,334 | 2787 | ||
| Number of SNPs per segment | ||||||||
| Avg | 9895.56 | 9823.59 | 9967.51 | 10,208.87 | 19,121.57 | 1004.46 | ||
| Med | 255 | 241 | 266 | 153 | 5657 | 13 | ||
| Max | 325,812 | 293,893 | 325,812 | 437,121 | 437,121 | 141,991 | ||
| After discarding segments with less than five phasable SNPs and shorter than 5 kb | ||||||||
| Physical lengthb | ||||||||
| Avg | 4.34 | 4.31 | 4.36 | 4.84 | 8.84 | 0.50 | ||
| Med | 0.95 | 0.95 | 0.95 | 0.44 | 3.52 | 494.45 kb | ||
| Max | 121.41 | 121.41 | 116.11 | 121.41 | 121.41 | 33.07 | ||
| Number of phasable SNPs per segment | ||||||||
| Avg | 3045.22 | 3024.33 | 3066.06 | 3388.28 | 6444.96 | 69.46 | ||
| Med | 318 | 310.5 | 325 | 78 | 2311 | 10 | ||
| Max | 110,344 | 110,344 | 86,208 | 112,056 | 112,056 | 2793 | ||
| Number of SNPs per segment | ||||||||
| Avg | 22,773.56 | 22,633.66 | 22,913.14 | 25,438.65 | 46,480.17 | 2592.66 | ||
| Med | 4951 | 4939 | 4971 | 2206 | 19,046 | 254 | ||
| Max | 605,700 | 605,700 | 601,045 | 653,272 | 653,272 | 160,715 | ||
A segment is defined as a run of consecutive phasable SNPs without switches
aWGS-P1: phased with LD information only; WGS-P2: phased with both LD and familial information
bUnless specified, all length units are in Mb
c“Singleton” refers to segments that contain only one SNP
Imputation reliability (measured as r 2 and given in %) for the two scenariosa of imputation
| Trusted set of variants | Traditional variant filtering | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| N | WGS-I1a | WGS-I2a | DI2-I1b | N | WGS-I1a | WGS-I2a | DI2-I1b | |||||
| Avg | Med | Avg | Med | Avg | Med | Avg | Med | |||||
| Overall | 5,149,267 | 90.47 | 93.63 | 90.65 | 93.81 | 0.18 | 13,129,937 | 88.66 | 93.57 | 89.07 | 93.87 | 0.41 |
| NMA = 2c | 79,755 | 56.74 | 67.47 | 59.98 | 86.76 | 3.24 | 680,303 | 63.04 | 80.94 | 67.95 | 96.96 | 4.91 |
| NMA = 3c | 78,933 | 69.10 | 71.02 | 70.78 | 75.35 | 1.68 | 510,080 | 77.28 | 92.25 | 79.00 | 95.49 | 1.72 |
| 0.01 < MAF ≤ 0.05 | 644,224 | 77.57 | 85.82 | 78.45 | 87.11 | 0.88 | 3,278,384 | 79.56 | 91.35 | 81.01 | 94.27 | 1.46 |
| 0.05 < MAF ≤ 0.10 | 673,955 | 89.15 | 91.65 | 89.21 | 91.81 | 0.06 | 2,047,206 | 89.91 | 92.82 | 90.07 | 93.15 | 0.16 |
| 0.10 < MAF | 3,831,088 | 92.87 | 94.20 | 92.95 | 94.30 | 0.08 | 7,804,347 | 92.15 | 93.91 | 92.19 | 93.96 | 0.04 |
| First Mb | 48,089 | 85.40 | 90.90 | 87.56 | 92.84 | 2.15 | 134,266 | 83.17 | 90.10 | 85.19 | 92.27 | 2.01 |
| Last Mb | 53,502 | 87.94 | 91.55 | 88.45 | 92.05 | 0.51 | 155,246 | 85.85 | 91.26 | 86.53 | 91.88 | 0.68 |
| Between first and last Mb | 5,045,959 | 90.55 | 93.68 | 90.70 | 93.84 | 0.16 | 12,840,425 | 88.75 | 93.62 | 89.14 | 93.91 | 0.39 |
aWGS-I1: imputation from GEN-P1 to WGS-P1 (using only LD information); WGS-I2: imputation from GEN-P2 to WGS-P2 (using both LD and familial information)
bDI2-I1: difference of average r 2
cNMA: number of occurrences of the minor allele
Fig. 3WGS-P1 and WGS-P2 phases of bovine autosome 2 for two animals of the training set. Consecutive SNPs with phase in compliance with Mendelian segregation rules delimit correct segments (in grey); conversely, consecutive markers with phase not in compliance with Mendelian segregation rules delimit incorrect segments (in black). Corresponding number of switches and proportion of errors are indicated on the right side of each phase