| Literature DB >> 29304727 |
Dorcus Kholofelo Malomane1, Christian Reimer2, Steffen Weigend3, Annett Weigend3, Ahmad Reza Sharifi2, Henner Simianer2.
Abstract
BACKGROUND: Single nucleotide polymorphism (SNP) panels have been widely used to study genomic variations within and between populations. Methods of SNP discovery have been a matter of debate for their potential of introducing ascertainment bias, and genetic diversity results obtained from the SNP genotype data can be misleading. We used a total of 42 chicken populations where both individual genotyped array data and pool whole genome resequencing (WGS) data were available. We compared allele frequency distributions and genetic diversity measures (expected heterozygosity (H e ), fixation index (F ST ) values, genetic distances and principal components analysis (PCA)) between the two data types. With the array data, we applied different filtering options (SNPs polymorphic in samples of two Gallus gallus wild populations, linkage disequilibrium (LD) based pruning and minor allele frequency (MAF) filtering, and combinations thereof) to assess their potential to mitigate the ascertainment bias.Entities:
Keywords: Ascertainment bias; LD based pruning; SNP filtering; SNP panels
Mesh:
Year: 2018 PMID: 29304727 PMCID: PMC5756397 DOI: 10.1186/s12864-017-4416-9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
List of breeds, their abbreviations and sample sizes as used in the study
| Breed and abbreviation | Array data (n) | WGS data (n) |
|---|---|---|
| Commercial breeds: | ||
| WL_A – White Leghorn line A | 20a | 10a |
| BL_A – Rhode Island Red line A | 20a | 15a |
| BL_D – White Rock line D | 20a | 15a |
| Wild populations: | ||
| GGg – | 10 (10) | 10 |
| GGsc – | 9 (10) | 9 |
| European populations: | ||
| ABwa – Barbue d’Anvers quail | 10 (10) | 10 |
| ARsch – Rumpless Araucana black | 9 (11) | 9 |
| BAsch – Rosecomb Bantam black | 10 (10) | 10 |
| BKschg – Bergische Crower | 10 (22) | 10 |
| DZgh – German Bantam gold partridge | 10 (10) | 10 |
| FZgpo – Booted Bantam millefleur | 10 (10) | 10 |
| HOxx – Dutch White Crested | 10 (7) | 10 |
| ITrh – Leghorn brown | 10 (10) | 10 |
| KAsch – Castilians black | 9 (11) | 9 |
| KRsch – Creeper black | 10 (20) | 10 |
| KRw – Creeper white | 10 (20) | 10 |
| LER11- White Leghorn line R11 | 9 (13) | 9 (1) |
| OMsschg - East Friesian Gulls silver penciled | 10 (10) | 10 |
| PAxx - Poland any colour | 11 (12) | 11 |
| SBsschs - Sebright Bantam silver | 10 (10) | 10 |
| WTs - Westphalian Chicken silver | 10 (10) | 10 |
| Asian populations: | ||
| ASrb – Aseel red mottled | 10 (10) | 10 |
| BHrg – Brahma gold | 10 (10) | 10 |
| CHgesch – Japanese Bantam black tailed buff | 10 (12) | 10 |
| CHschw – Japanese Bantam black mottled | 10 (19) | 10 |
| COsch – Cochin black | 10 (11) | 10 |
| DLIa – German Faverolles salmon | 10 (10) | 10 |
| KSgw – Ko Shamo black-red | 9 (13) | 9 |
| MAxx – Malay black red | 10 (21) | 10 |
| MRschk – Marans copper black | 10 (10) | 10 |
| NHL68 – New Hampshire line 68 | 9 (14) | 9 (1) |
| OFrbx – Orloff red spangled | 10 (15) | 10 |
| OHsh - Ohiki silver duckwing | 10 (10) | 10 |
| ORge - Orpington buff | 10 (10) | 10 |
| SAsch - Sumatra black | 9 (11) | 9 |
| SEw - Silkies white | 10 (10) | 10 |
| SHsch - Shamo black | 9 (11) | 9 |
| SNwsch - Sundheimer light | 10 (10) | 10 |
| TOgh - Toutenkou black breasted red | 10 (11) | 10 |
| WYw - Wyandotte white | 10 (9) | 10 |
| YOwr - Yokohama red saddled white | 10 (10) | 10 |
| ZCw - Pekin Bantam white | 10 (10) | 10 |
n is number, in brackets () are additional individuals added to the population (not present in the other data type)
acompletely different individuals in the two data sets
Array data set versions with different filtering strategies applied
| Given name for data set | Filter/s applied | No of SNPs |
|---|---|---|
| Array_all | 401, 125 | |
| Array_MAF5 | Filtered out SNPs with less than 5% MAF | 379, 342 |
| GG | Retained only SNPs that are polymorphic in the two | 289, 390 |
| GG_MAF5 | GG and filtered out SNPs with MAF less than 5% | 284, 748 |
| Pruned | SNPs were pruned based on LD | 122, 006 |
| Pruned_MAF5 | Pruned and filtered out SNPs with MAF less than 5% | 107, 604 |
| Pruned_GG | Pruned and GG | 86, 404 |
| Pruned_GG_MAF5 | Pruned_GG and filtered out SNPs with MAF less than 5% | 82, 975 |
Fig. 1Allele frequency spectrum of array data and corresponding WGS loci for 39 populations
Fig. 2Allele frequency spectrum of array data (left) and WGS data (right) for 39 populations
Fig. 3Comparisons of expected heterozygosity (H) estimates between WGS (boxplot of 100 replicates) and array (Array_all, GG and Pruned) data
Relationship between the H estimates between WGS and the array data sets
| rs | Slope | |
|---|---|---|
| Array_all | 0.956 | 2.233 |
| Array_MAF5 | 0.957 | 2.321 |
| GG |
| 2.770 |
| GG_MAF5 | 0.984 | 2.790 |
| Pruned | 0.973 |
|
| Pruned_MAF5 | 0.974 | 2.340 |
| Pruned_GG | 0.983 | 2.675 |
| Pruned_GG_MAF5 | 0.983 | 2.717 |
r – Spearman’s rank correlation. Slope – the slope of regression line when the H estimates of array data are regressed against those of WGS data
aNumbers in bold face represent the best value in the column. These results are based on 39 populations
Fig. 4Regressions through the pairwise F values between WGS and array data. Black lines represent the expected identity relationship between the two data sets (with a slope of 1)
The relationship between the F estimates of the WGS and array data
| WGS | |||||
|---|---|---|---|---|---|
| Slope | R2 | Regression constant | Standard error (SE) | Residual variance | |
| Array_all | 1.179 | 0.954 | −0.028 | 0.009 | 0.0001 |
| Array_MAF5 | 1.183 | 0.954 | −0.027 | 0.010 | 0.0001 |
| GG | 1.197 |
| −0.028 | 0.009 | 0.0001 |
| GG_MAF5 | 1.197 |
| −0.028 | 0.009 | 0.0001 |
| Pruned |
| 0.937 | −0.017 | 0.010 | 0.0001 |
| Pruned_MAF5 | 1.033 | 0.939 | −0.016 | 0.010 | 0.0001 |
| Pruned_GG | 1.055 | 0.940 | −0.018 | 0.010 | 0.0001 |
| Pruned _GG_MAF5 | 1.057 | 0.941 | −0.017 | 0.010 | 0.0001 |
aNumbers in bold face represent the best value in the column. R2 – regression coefficient. These results are based on 39 populations
Frobenius (F) distances between distance matrices of WGS and array data
| Array_all | Array_MAF5 | GG | GG_MAF5 | Pruned | Pruned_MAF5 | Pruned_GG | Pruned_GG_MAF5 | |
|---|---|---|---|---|---|---|---|---|
| Array_all | 5.312 ± 0.001 | |||||||
| Array_MAF5 | 0.591 | 5.889 ± 0.001 | ||||||
| GG | 1.239 | 0.685 | 6.501 ± 0.001 | |||||
| GG_MAF5 | 1.434 | 0.868 | 0.200 | 6.700 ± 0.001 | ||||
| Pruned | 2.230 | 2.810 | 3.397 | 3.596 | ||||
| Pruned_MAF5 | 1.332 | 1.886 | 2.447 | 2.644 | 0.971 | 4.115 ± 0.001 | ||
| Pruned_GG | 1.034 | 1.530 | 2.038 | 2.232 | 1.417 | 0.462 | 4.548 ± 0.002 | |
| Pruned_GG_MAF5 | 0.811 | 1.216 | 1.676 | 1.867 | 1.800 | 0.836 | 0.329 | 4.931 ± 0.002 |
The diagonal is a mean of the F distance between the array data set and 100 WGS replicates with the standard errors (SE)
aNumbers in bold face represent the best value in the column
Fig. 5Neighbour joining trees of WGS, Array_all, GG and Pruned data sets
Fig. 6Topological distances between the NJ trees of array and 100 replicates of WGS data based on the Billera method. The boxplots reflect distances between the 100 replicates of WGS and the blue dots are mean distance between the array set and the 100 WGS replicates
Fig. 7Topological distances between NJ trees of array and 100 replicates of WGS data based on the Penny and Hendy method. The boxplots reflect distances between the 100 replicates of WGS and the blue dots are mean distance between the array set and the 100 WGS replicates
Fig. 8Two dimensional PCA plots of WGS and array (Array_all, GG and Pruned) data