In most studies aimed at localizing footprints of past selection, outliers at tails of the empirical distribution of a given test statistic are assumed to reflect locus-specific selective forces. Significance cutoffs are subjectively determined, rather than being related to a clear set of hypotheses. Here, we define an empirical p-value for the summary statistic by means of a permutation method that uses the observed SNP structure in the real data. To illustrate the methodology, we applied our approach to a panel of 2.9 million autosomal SNPs identified from re-sequencing a pool of 15 individuals from a brown egg layer line. We scanned the genome for local reductions in heterozygosity, suggestive of selective sweeps. We also employed a modified sliding window approach that accounts for gaps in the sequence and increases scanning resolution by moving the overlapping windows by steps of one SNP only, and suggest to call this a "creeping window" strategy. The approach confirmed selective sweeps in the region of previously described candidate genes, i.e. TSHR, PRL, PRLHR, INSR, LEPR, IGF1, and NRAMP1 when used as positive controls. The genome scan revealed 82 distinct regions with strong evidence of selection (genome-wide p-value<0.001), including genes known to be associated with eggshell structure and immune system such as CALB1 and GAL cluster, respectively. A substantial proportion of signals was found in poor gene content regions including the most extreme signal on chromosome 1. The observation of multiple signals in a highly selected layer line of chicken is consistent with the hypothesis that egg production is a complex trait controlled by many genes.
In most studies aimed at localizing footprints of past selection, outliers at tails of the empirical distribution of a given test statistic are assumed to reflect locus-specific selective forces. Significance cutoffs are subjectively determined, rather than being related to a clear set of hypotheses. Here, we define an empirical p-value for the summary statistic by means of a permutation method that uses the observed SNP structure in the real data. To illustrate the methodology, we applied our approach to a panel of 2.9 million autosomal SNPs identified from re-sequencing a pool of 15 individuals from a brown egg layer line. We scanned the genome for local reductions in heterozygosity, suggestive of selective sweeps. We also employed a modified sliding window approach that accounts for gaps in the sequence and increases scanning resolution by moving the overlapping windows by steps of one SNP only, and suggest to call this a "creeping window" strategy. The approach confirmed selective sweeps in the region of previously described candidate genes, i.e. TSHR, PRL, PRLHR, INSR, LEPR, IGF1, and NRAMP1 when used as positive controls. The genome scan revealed 82 distinct regions with strong evidence of selection (genome-wide p-value<0.001), including genes known to be associated with eggshell structure and immune system such as CALB1 and GAL cluster, respectively. A substantial proportion of signals was found in poor gene content regions including the most extreme signal on chromosome 1. The observation of multiple signals in a highly selected layer line of chicken is consistent with the hypothesis that egg production is a complex trait controlled by many genes.
‘Selection signatures’ are defined as regions of the genome that harbour functionally important sequence variants and therefore are or have been under either natural or artificial selection. The physical extent of such signatures, up- and downstream of the functional variant, is a consequence of the so-called hitchhiking effect. As stated by Maynard Smith and Haigh [1], three patterns are generated locally around the position of a favorable mutation. First, the density of segregating sites decreases in adjacent regions so that the level of variability will be reduced [2], [3]. Second, the site frequency spectrum (SFS), which describes the frequency of allelic variants, shifts from its neutral expectation towards a relative excess of extreme (rare or high) frequencies [4], [5]. Third, a specific linkage disequilibrium (LD) pattern emerges around the target of positive selection relative to what is expected under neutrality [6], [7].The search for molecular signatures of positive selection has been a matter of intense research in recent years, motivated by the hope to associate genes that experienced recent strong selection with functions and phenotypes (for review see [8], [9]. These studies have resulted in the development of various statistics aimed to detect selection at the DNA level in population samples. The methods used are based either on the site frequency characteristics (focusing on single loci) or on properties of haplotypes segregating in populations.In site frequency based methods the level of DNA polymorphism is assessed for a very large number of loci on a genome-wide scale within populations. Conceptually, the goal is to identify genomic regions with a reduced variation or a different shape of the SFS than the norm of the genome. These methods essentially assume that demographic effects and population structure affect the whole genome in the same fashion; on the other hand positive selection should influence only individual genes and, through the hitchhiking affect, the surrounding regions. This concept has been used on a genome-wide scale to detect signals of past selection in humans and other species [10]–[15]. Genomic scans for local variability have also been conducted in chicken [16], [17]. The last authors introduced the so-called “Pooled Heterozygosity” (H) statistic, a variability estimator based on allele counts across sliding windows of adjacent loci to look for areas that deviate from the norm.It is important to note that many of these studies have focused on the observed distribution of a given test statistic, assuming that loci in the tails of this distribution have been targets of recent selection [18]. Although this approach to detect selective sweeps in genome-wide data sets seems appealing, questions about the statistical validity of this strategy have been raised [19], [20], [9]. Since, as highlighted by Williamson et al. [21], the prevalence of selection in the genome is unknown, the “empirical p-value” strategy does not directly test the hypothesis of selection at any putative locus and provides no means for quantifying how common selection is across the genome. For instance, the null hypothesis of selective neutrality could be true for the entire genome, in which case even the most extreme values would carry no information regarding selection.Kim and Stephan [22] proposed the composite likelihood ratio (CLR) test to localize selective sweeps in subgenomic regions based on the change in the shape of the allele frequency spectrum. They used coalescent simulations to derive a distribution of the test statistic under the null hypothesis of no selection. However, the use of simulation requires accurately mimicking population demography as well as making assumptions that may or may not hold (e.g., uniform recombination or mutation rate across the genome, etc). In a similar study Nielsen et al. [23] extended the CLR test to derive the expected background pattern of variability from the data itself, rather than from a population genetic model. This approach compares a neutral null model for the evolution of a genomic window with a selective sweep model and can be applied to species having sufficient genome wide SNP data available [7], [21], [24]. It appears that CLR is one of the few metrics that robustly tests the statistical significance of a putative region for the hypothesis of positive selection.In this study, we compared genome-wide H estimates based on 2.9 million SNPs from a commercial line of egg laying chickens (see methods). We employed a modified sliding window approach (referred to as a “creeping window”, CW) and validated the method by confirming the identification of previously described candidate genes. Furthermore, we used a permutation method that uses the original allele frequency spectrum of the genome under study to define the significance thresholds for the H values. In total 132 genes or genomic regions that display patterns of genetic variation consistent with the hypothesis of positive selection are presented, comprising some striking examples of selective sweeps that span over several megabases.
Results and Discussion
Creeping windows
Scanning a genome by sliding a non-(or partly) overlapping window of uniform length along the sequence is a common strategy in site frequency based methods. The primary objective of such “sliding windows” (SW) strategies is to reduce the noisiness of single-locus statistics by combining data from several adjacent markers. The window size is often subjectively determined which can influence the final results and interpretations. Regarding the fact that continuous stretches of H values (or any site frequency based metric) are correlated to an extent determined by the level of linkage disequilibrium, one may suggest adjusting window sizes such that the extent of LD is reflected [25]. However, it remains unclear how to account for the age of selective sweeps in view of their diverse length, as well as for varying levels of local LD across the genome or between populations. The CW method we used (see Methods) has the advantage of simplicity and is applicable with all site frequency based statistics. In addition, the algorithm accounts for the non-uniform distribution of markers, so that artifact signals originating from conflicting effects of genomic gaps are avoided. Figure 1 illustrates lack of uniformity in the distributions of inter-marker distance and gap size.
Figure 1
Histograms of (A) distance between neighboring markers and (B) gap size in the final data set.
To evaluate the performance of the CW approach the distribution of pooled heterozygosity profiles was compared with different implementations of the SW approach. Applying the CW strategy with H values genome-wide resulted in 862’400 windows. The mean number of SNPs in a window was 199±78 with window size varying between 30 and 40 Kb, having a median of 39’909 bp (Figure 2). While 40 Kb was the specified standard, shorter windows may result from gaps in the CW approach.
Figure 2
Distributions of (A) the number of SNPs per window and (B) the size of 862’400 windows creeping along chicken chromosomes GGA1 to GGA28.
In the SW scenarios we walked through the genome with non- or partly overlapping windows of size 40 Kb in steps of 0 to 20 Kb, respectively (data are shown only for the 20 Kb overlapping scenario). A panel of 40’289 windows in the partly overlapping scenario was created, which is explicitly a function of the extent of overlap between consecutive windows. Figure 3 illustratively compares the negative end of the ZH distribution for 61’538 creeping versus 2658 sliding windows, respectively, across chromosome GGA5.
Figure 3
The negative end of the ZH distribution from the creeping windows (CW) versus the sliding windows (SW) strategy is presented along GGA5.
The horizontal dashed line stands for the significance level at P≤0.001 (ZH = −3.70, genome-wide) and vertical gridlines help to compare similar signals between plots. Capital letters highlight positions where (A) CW produces more pronounced signals than SW; (B) SW misses signals found by CW; (C) SW produces spurious signals not confirmed by CW; and (D) CW finds classic long-range sweeps with typical patterns. A strong signal is found at the position of the TSHR gene already described by Rubin et al. (2010).
The negative end of the ZH distribution from the creeping windows (CW) versus the sliding windows (SW) strategy is presented along GGA5.
The horizontal dashed line stands for the significance level at P≤0.001 (ZH = −3.70, genome-wide) and vertical gridlines help to compare similar signals between plots. Capital letters highlight positions where (A) CW produces more pronounced signals than SW; (B) SW misses signals found by CW; (C) SW produces spurious signals not confirmed by CW; and (D) CW finds classic long-range sweeps with typical patterns. A strong signal is found at the position of the TSHR gene already described by Rubin et al. (2010).The comparison of the two profiles shows the following main discrepancies:The magnitudes of extreme signals obtained with the CW approach are higher than those obtained with the SW approach (e.g. at position A);The CW approach reports clear signatures of selection that are missed by the SW approach (e.g. at position B)The SW approach produces some spurious signals that are not confirmed by the CW analysis as these may be artifacts caused by gaps in the sequence (e.g. at position C)The CW approach identifies clear stretches of a selective sweep, with a typical gradient of decreasing ZH values from both sides, which is much less pronounced in the SW approach (e.g. at position D)These examples highlight the possibility that some selection signatures may have been missed or erroneously accounted for in previous studies based on SW approaches. In general, our results indicate an improved efficiency in signal detection for scanning genomes with the CW strategy. However, it must be noticed that intensified resolution sharply enlarges the number of windows, which affects the multiple testing issue.With the example of chromosome GGA5 (Figure 3) the ability to detect a selection signature using creeping windows of 40 kb is confirmed by localizing the previously described TSHR gene [17], along with two typical selective sweeps of different size depicting valleys of heterozygosity (D). One distinct sweep is observed at chromosomal position 19.3 to 21’5 Mbp harboring the APIP, PDHX, CD44, ACTC1, GPIAP1, NAT10, RAG1 and RAG2 genes. Another evident sweep is spanned over 1.27 to 1.72 Mbp overlapping the IGHMBP2, SYT12, Cor6, SIRT3, RIC8A, NADSYN1 and ZDHHC13 genes.
Revealing genome-wide significant signals
Faced with problems in determining the null distribution of a test statistic, researchers often focus on top-ranking SNPs and avoid specifying testable hypotheses. However, an outlier locus is not necessarily indicative of selection. In such an approach there are basically no a priori criteria available for deciding how extreme a region needs to be in order to claim a selection signal and the significance cutoffs are determined subjectively, rather than being derived from a model. Using permutation re-sampling in this study we derived a null distribution for testing the genome-wide significance under the null hypothesis of absence of selection (see Methods). Briefly, this permutation method maintains the original structure observed in the real data set such as the SNP density, and the background distribution of H values is computed after the frequencies of the SNPs are shuffled.Evidence of positive selection was investigated by assessing variation in allele frequency across the genome. In total, 862’400 windows were tested. The mean H value was estimated as 0.418±0.045 and the lowest H was 0.196 for a region on chromosome GGA1. Figure 4 compares the distribution of H values from the observed data against the profile of the lowest H values recorded in each permutation. As shown, the lower limit of H values obtained from 10’000 permuted datasets was 0.250 whereas the lowest H value from real data was 0.196. Accordingly, we placed the critical value for claiming candidate selective sweeps with an empirical genome-wide significance level P≤0.001 at H = 0. 252 (ZH = −3.70) and windows below this threshold were considered to represent selection signals. In total, 1816 putative windows, many of them overlapping, with a statistically significant (P≤0.001) departure from the norm of allelic variability were observed. This number exceeds the number obtained when we just accept the 0.1 per cent smallest (i.e. 862) values, as done in the usual outlier approach. However, with less stringent thresholds on the empirical p-values, for instance p≤0.01 (ZH<−3.50, genome-wide), only 3846 significant windows are obtained, which is considerably less than the 1% (8624) top-ranked signals.
Figure 4
Distribution patterns of the H profile from 862’400 windows creeping over the genome.
Pink and blue densities represent, respectively, the observed and the panel of recorded lowest H-values from 10’000 re-sampling runs in real data. Windows with H≤0.252 represent significant signals at the empirical error level P≤0.001. As indicated, 1816 windows characterize 82 selected regions with a more extreme local homozygosity than expected under neutrality.
Distribution patterns of the H profile from 862’400 windows creeping over the genome.
Pink and blue densities represent, respectively, the observed and the panel of recorded lowest H-values from 10’000 re-sampling runs in real data. Windows with H≤0.252 represent significant signals at the empirical error level P≤0.001. As indicated, 1816 windows characterize 82 selected regions with a more extreme local homozygosity than expected under neutrality.A striking feature that emerges when examining the distribution of allelic variability via the CW strategy is that diversity values tend to cluster together. This results in consistent signatures of selective sweeps for adaptive alleles, which in some cases extend over stretches of several megabasepairs. We considered these adjacent signatures as “distinct” if they typically exhibited the pattern of decaying H by distance to both sides (cf. position D in Figure 3).Across the genome, we counted signals of pooled heterozygosity (P≤0.001) that were accompanied by at least two consecutive significant windows (P≤0.05, genome-wide). In total, we observed 82 clusters representing strong evidence of selective sweeps. However, we believe that additional loci further down the list deserve closer examination in follow-up studies. The number of detected regions rose to 132 when the significance threshold of pooled heterozygosity was set to P≤0.01. Table S1 presents test statistics including the number of signals on each chromosome and positions for the full panel of regions that fell below H = 0.272 (P<0.01, genome-wide). The observation of multiple signals in a commercial layer line is consistent with the hypothesis that egg production is a complex trait controlled by many genes.In order to visualize the chromosomal distribution of significant signals, we plotted the ZH statistic against genomic position (Figure 5). Furthermore, a detailed graphical representation of the ZH signals for the 28 autosomes is reported in supporting information (Figures S1, S2, S3, S4, S5, S6, S7, S8, S9, S10, S11, S12, S13, S14, S15, S16, S17, S18, S19, S20, S21, S22, S23, S24, S25, S26, S27, S28). It is evident that the signals are non-uniformly distributed across chromosomes and chromosome segments.
Figure 5
The negative tail of the ZH distribution presented along GGA1 to GGA28.
Each dot represents a CW of 40 Kb and arrows point at the location of candidate genes (Table 1) and genes with reported associations in the literature. The horizontal dashed line indicates the significance threshold at P≤0.001 (Genome-wide ZH = −3.7).
The negative tail of the ZH distribution presented along GGA1 to GGA28.
Each dot represents a CW of 40 Kb and arrows point at the location of candidate genes (Table 1) and genes with reported associations in the literature. The horizontal dashed line indicates the significance threshold at P≤0.001 (Genome-wide ZH = −3.7).
Table 1
Summary statistics of the pooled heterozygosity metric for selection signature in candidate genes.
Gene
Chr
Position-bp
HP
P≤
Function/association
IGF1
1
57’327’750..57’376’178
0.24
0.001
Key regulator of muscle development and energy metabolism in birds.
[17], [26], [27], [28]
PRL
2
59’724’582..59’730’725
0.26
0.01
Egg laying pattern and production
[28]
TSHR
5
43’202’356..43’250’961
0.25
0.001
Inhibitory effect on Growth hormone secretion
[17]
PRLHR
6
31’242’680..31’243’785
0.24
0.001
Governing early embryonic axis formation
[29]
NRAMP1
7
24’283’380..24’363’380
0.21
0.001
Natural resistance to Salmonella infection and macrophage function
[30], [31]
LEPR
8
29’125’599..29’156’553
0.21
0.001
Affecting feed efficiency
[32]
INSR
28
3’431’232..3’471’081
0.27
0.01
Insulin signaling
[17]
Simulation
To evaluate the performance of our method for delimiting the significance of a selective sweep, we performed computer simulations. We considered models involving both neutrality (Neut) and a selective sweep (SP) at a single locus. The genomic distribution of SNPs and selective sweeps (i.e., one Sweep per 10 Mb) in the simulation scheme corresponds roughly to the chicken genome analyzed with the current SNP array. Two H sets with 23’265 and 22’955 windows were calculated in the Neut and the SP models, respectively. The mean Neut was estimated as 0.282±0.014 with a minimum of 0.234 which dropped to 0.224 in the sweep scenario. Figure 5. a, b respectively, depicts the profile of H values in the Neut and the SP simulations along with the location of the selective sweep in the middle of the simulated chromosome.We applied our test statistic to both simulated data sets. For this, allele frequencies from each data set were randomly shuffled across chromosomal positions and a profile of the smallest H values was generated from 1000 iterations. To compare the distribution of H profiles, we plotted the kernel density from both scenarios against their minimum profiles from permutations (Figure 6. c). As shown a perfect overlap is evident between both simulations and corresponding permutations except for H windows representing the selective sweep. The smallest H value from the neutral simulation is 0.234 which is distant from the minimum H value from permutations (0.226). Therefore, the test correctly assigns a non-significant p-value≤0.62 to the lowest heterozygosity window of the neutral simulation. On the other hand, in the selection scenario, the lower limit of SP = 0.224 was only exceeded in three permutation resamples with a minimum of H = 0.222. This signifies that the simulated sweep is conservatively detected at a significance level of p≤0.003 (SP = 0.224).
Figure 6
A graphical representation of simulation results from two genetic models.
Neut and SP abbreviate genetic models with neutral and selective sweep along with Neut.per and SP.per representing corresponding distribution of permutation. H profiles estimated from 23’265 and 22’955 creeping windows are plotted across a chromosome of 10 Mb in Neutral (a) and Sweep (b) models, respectively. A distinct valley of homozygosity at the middle of chromosome represents the simulated sweep. (c) Density distributions of H profiles form both models are depicted along with 1000 resamples.
A graphical representation of simulation results from two genetic models.
Neut and SP abbreviate genetic models with neutral and selective sweep along with Neut.per and SP.per representing corresponding distribution of permutation. H profiles estimated from 23’265 and 22’955 creeping windows are plotted across a chromosome of 10 Mb in Neutral (a) and Sweep (b) models, respectively. A distinct valley of homozygosity at the middle of chromosome represents the simulated sweep. (c) Density distributions of H profiles form both models are depicted along with 1000 resamples.
Validation with candidate genes
The thyroid-stimulating hormone receptor (TSHR) gene, a well-documented example of positive selection in the chicken [17] was used as a positive control to examine the validity of our approach. We extended the analysis to additional candidate genes known to be related with production traits and, therefore, being potentially under positive selection. For instance, the insulin-like growth factor 1 (IGF1) is known to be associated with growth, body composition, and skeleton integrity in chickens [26], [27]. Candidate genes were identified from the literature and databases including NRAMP1, PRL, PRLHR, INSR, LEPR and IGF1. We could not include GHR, PRLR and the BCDO2 gene causing yellow skin colour in our validation panel because they were either located on chromosome Z or the SNP coverage in the corresponding regions was not sufficient to effectively test the variability of these regions. The regions surrounding the genes displayed an elevated homozygosity compared to the genome-wide average. Table 1 presents the names, position and summary statistics for the chosen panel.The results revealed a significantly different H profile in most of the candidate regions. For example, a window perfectly overlapping the IGF1 gene on GGA1 represents the most extreme signal in the corresponding region. Consistent with Rubin et al. [17], we further observed locally reduced variation for a short region surrounding the TSHR gene on GGA5 (Figure 3). In contrast, some regions contained several consecutive windows with consistently low H values. For instance, the Leptin receptor gene (LEPR), a candidate gene with a central role for Leptin signaling affecting feed efficiency, displays statistically significant low-H windows extended over several Mb (Figure S8), possibly indicating that this locus has been subject to recent selective pressures. In addition to the age of selection, several factors may affect the size of a selective sweep, like the local recombination rate, whether the selected variant ever reached complete fixation, the number of generations it took before fixation and any population admixture at a time point after the sweep initially occurred.
Functional annotation of regions under selection
We annotated the genomic regions harboring significant signals using the map viewer program, and by aligning the positions to the second draft of the chicken genome sequence assembly, to reveal genes and ESTs located in the respective region. Table 2 summarizes statistics for a collection of selected regions across the genome harboring the strongest signals along with the distinct sweeps. The window with the smallest H value (H = 0.196, P<0.001) was observed on GGA1 embedded within 130’539’515 to 130’579’189 bp. This is a poor gene content region with no coding sequence mapped. The region, however, depicts the pattern of a distinct sweep spanning over 2 Mb (Figure S1). We extended the window to its decaying domains in both directions up to 700 kb to find the biologically most interesting candidate gene in this region. Of the 8 ESTs in this region, haloacid dehalogenase-like hydrolase domain containing 1A, was the only gene in the region. HDHD1 is a conserved gene in many species and very little is known about its biological importance. Another strong signature of selection on GGA2 (H = 0.203, P<0.001) matched the Calbindin 1 gene. CALB1 is a 28,000-kDa calcium-binding protein, which fluctuates in a circadian fashion during the daily egg cycle, in close temporal association with eggshell calcification [33], [34]. It was shown that the pattern of CALB1 expression is related to eggshell quality [33] and eggshell abnormalities in layer chickens [35]. Association was also demonstrated between CALB1 gene expression and reduction of eggshell thickness after xenoestrogen treatment [36]. Moreover, on chromosome 4, a region harboring the secreted phosphoprotein 1 or Osteopontin gene showed a signal of positive selection (P-value<0.01). It was suggested that SPP1 could be involved in the mechanism controlling the arrest of eggshell calcification [37] and the specific occlusion of SPP1 into calcite during mineralization may influence eggshell structure and thereby modify its fracture resistance [38]. There are also reports that polymorphisms within the Osteopontin gene are associated with 5-week body weight in egg laying chickens [39]. Further to the strong signal overlaying the Nramp1/SLC11A1 gene, which is a well documented candidate for immune traits in chickens, a distinct sweep (P<0.001) was detected on chromosome 3 embedding the gene cluster Gallinacin 1–13. This cluster is designated densely within a 86-Kb distance and encodes Avian beta-defensins, a family of antimicrobial peptides that are capable of killing a broad spectrum of pathogens and play a critical role in innate immunity in chickens [40]. Beta-defensins are also present in different compartments (eggshell, egg white, and vitelline membranes) of the egg and are expected to be involved in the protection of the embryo during its development and to contribute to the production of pathogen-free eggs [41].
Table 2
Collected panel of genomic regions identified as candidate selective sweeps.
Positions in normal format represent “distinct sweeps” revealed by the H metric. A distinct sweep spans over numerous consecutive significant windows and depicts a typical valley of heterozygosity.
Signals overlapping genes with a previously described association.
Positions in normal format represent “distinct sweeps” revealed by the H metric. A distinct sweep spans over numerous consecutive significant windows and depicts a typical valley of heterozygosity.Signals overlapping genes with a previously described association.Some of the regions identified contain genes with biological functions that were previously discussed in connection to traits under selection in other species. For example, strong evidence of a sweep reflected by a set of windows on GGA4 (P<0.001) involves the bone morphogenetic protein receptor, type IB gene (BMPR1B) which is a major determinant of ovulation rate and litter size in sheep [42], [43]. A candidate gene affecting growth traits and with a central role in regulating IGF gene, insulin-like growth factor binding protein (IGFBP), also lies within a distinct sweep region on GGA4 (Table 2). We also found several other regions harboring genes with biological functions that could be related to (production) traits. In general the annotation list (Table S1) is enriched with genes of biological interest involved in carbohydrate metabolism pathways, muscle-skeletal structure development, solute carrier proteins, calcium signaling pathways and the immune system.The first genome-wide scan of selection for local homozygosity in the chicken was performed by ICPMC [16] using sequence data from only 3 individuals representing layers, broilers and the Red Jungle fowl, respectively. In a more comprehensive study, Rubin et al. [17] re-sequenced pooled DNA from a number of commercial and domestic lines to identify selective sweeps of favorable alleles. Local heterozygosity was calculated in sliding windows of 40 Kb, and seven putative selective sweeps were detected in layers at 6 standard deviations away from the genome mean. In addition to the aforementioned candidate genes TSHR, INSR and IGF1, two out of seven regions overlapped with regions revealed in the present study. Identification of these regions in two independent studies supports the hypothesis that these regions have strong signatures of selection and are likely to be true positives.There are, however, several regions with strong evidence of selection identified in our study that were not reported previously. Apart from genetic drift, the differences may result in part either from the insufficient power of the tests employed or from insufficient coverage in the datasets scanned. The SNP calling depth in the current study was at least four times larger than the one in earlier studies, which provides more reliable allele frequency estimates. As demonstrated above, the scanning resolution in the CW approach is much better than the one obtained with SW, which raises the possibility that some signals may have been missed or falsely reported in previous studies (cf. Fig. 3). The inconsistencies can also originate from the lack of a consensus threshold in empirical approaches. Earlier studies just reported a fraction or the most extreme results (i.e. the 1% or 0.1% outliers in the empirical distribution), while in our study a permutation-based genome-wide significance threshold was applied. Combining this conservative testing strategy with the identification of candidate regions (characterized by a series of significant windows) yielded a relatively low number (132) of significant regions for selective sweeps (listed in Table S1) albeit of high credibility. Finally, there are signals that probably do not reflect historic selection at all, but rather arise from local genomic differences in mutation or recombination rates, or are statistical outliers in multiple genome-wide tests for significance.
Conclusions
We adapted a permutation-based re-sampling method as a valid approach to test the significance of differences in local variability. The method uses the original allele frequency spectrum of the genome under study to maintain the observed SNP structure for defining an empirical p-value. However, it assumes a uniform demography across the genome and generates the null distribution based on independence of allele frequency estimates between neighboring SNPs which is violated in a real scenario. We realize the permutation approach to testing for significance is very straightforward, and it may be argued that more sophisticated methods could generate a null distribution by performing neutral simulations with a range of demographic and recombination effects. However this bears its own challenges in defining the models appropriately such that they reproduce the full SNP structure in the data set, and even then we are not certain it would yield greater sensitivity or specificity in detecting sweeps. We also improved the resolution of signal detection using a creeping window strategy. Genome-wide, 82 regions with strong evidence of selection (P-value<0.001) were identified including genes known to be associated with eggshell quality and immune system, such as CALB1 and the GAL cluster. Our results confirm the presence of selective sweeps in regions of previously described candidate genes, in some cases spanning over intervals of several megabases. The observation of multiple signals is consistent with the hypothesis that egg production is a complex trait controlled by many genes. The major challenge remains to distinguish true signals from those due to genetic drift. One possible solution involves analyzing separate populations with different phylogenetic history, but selected for similar breeding goals (e.g. white-egg layers and brown-egg layers), hypothesizing that true signals generated by selection would overlap across the populations. Such efforts are currently underway by the authors, along with validation of results obtained with other methods of selection signature detection. Further research should also try to verify hypothesized relationships between gene networks rather than single genes underlying the observed pattern of selection signatures. Our results may be of future interest for identifying signatures of artificial selection in commercial chicken breeds or as additional evidence for any polymorphism that shows associations with egg production traits.
Materials and Methods
Ethics statement
Samples were collected by veterinarians in the Lohmann company in the course of a routine health check for diagnostic reasons and a partition of these samples was used to extract DNA. The authors collected no samples themselves.
Whole genome re-sequencing and SNP discovery
We studied a commercial brown layer line provided by Lohmann Tierzucht GmbH. Blood samples were collected with EDTA as anticoagulant from the wing vein of 15 unrelated female birds originating from different sire families. DNA was extracted from blood samples following a standard Phenol/Chloroform extraction protocol [44]. DNA quality and concentration of each sample was calculated and equal amounts of DNA of 15 samples were mixed to produce the DNA pool for sequencing.Sequencing libraries were constructed with paired-end DNA sample preparation kits (Illumina) according to the manufacture's recommendations. Sequencing was carried out on an Illumina Genome Analyzer IIx as 76 bp paired-end reads. We sequenced two lanes of a flow cell yielding 22.0 Gb reads. Image analysis and base calling was performed using the Genome Analyzer Pipeline software.Sequence reads were mapped against the third build of chicken reference genome (yet to appear officially in the public databases). Prior to mapping, the reference genome was repeat masked using RepeatMasker. To further remove potentially problematic areas of the genome, 16 mers occurring more than 5 times were also masked. Reads were aligned to the reference genome using BWA version 0.5.7 with default parameters. Samtools version 0.1.7 [45] was used to remove potential PCR duplicates and to call SNPs. About 112.0 million reads aligned to the genome with a mapping quality score of 20 or more. A SNP was called when the position was covered by at least 5 reads with a mapping quality score of 20 or over, and a base quality score of 20 or over.
Quality control and data filtering
The total number of SNPs detected in the pool was 4’540’269. We checked the markers for redundant positions and applied a number of rules to edit SNPs for further analysis. To minimize incorporating false SNP, we used Phred scaled SNP quality score, Q, which is related to the SNP calling error probability (p) by the equation: Q = −10×log10(p). The average SNP quality score was estimated as 100±50. We kept polymorphisms with a minimum score of 20 (99% accuracy) as an acceptable error rate (Figure S29 in the Online Data Supplement).Polymorphisms detected had a read depth between 1 and 20’895. To reduce potential errors in SNP frequency estimates in the pool of 15 individuals, and to preclude over-representation of repetitive sequences, we only used polymorphisms with a read-depth between 15× and 50×. In the final data set the average read depth was 21.9±5.0. Figure S30 displays the distribution of the depth of SNP calling in the final data set analyzed.Analysis of the inter-marker distance between polymorphisms revealed numerous genomic gaps (regions free of SNPs) on some microchromosomes and chromosome Z of up to 5 Mb or larger. Therefore, only autosomes GGA1-GGA28 were included in the final analysis. In the filtered data the average inter-marker space was estimated as 314.2±1362.1 bp (median = 97 bp), and 5503 gaps were present across the genome. Figure S31 presents a genome wide image of marker distribution in the original SNP panel. The accumulated proportion of genomic gaps was estimated as 13.3% of the genome after filtering.In total 2’913’540 SNPs on 28 autosomes were included in the final analysis. Average minor allele frequency (MAF) was 0.31±0.11, and only 16’135 markers (0.5%) had a minor allele frequency of less than 10% (Fig. S32). The pattern of MAF distribution was fairly similar to those from already available commercial 37 K and 60 K Illumina bead chips.
Detecting selective sweeps
To identify genomic regions that may have been targets of past selection, we used the pooled heterozygosity (H) statistic suggested by Rubin et al. [17]. For a window with loci,where is the number of reads at locus and is the number of reads of the most abundant allele at locus . H values were z-transformed to ZH values with mean = 0 and SD = 1 to facilitate visualization of the outlying signals and comparison with previous reports.
Sliding and creeping window approach
To facilitate comparisons of genomic regions with a higher resolution we adopted a more expedient approach called “creeping window” to scan the entire genome for evidence of selective sweeps (Figure 7). This is an intensified “sliding window” strategy that moves windows in steps of only one SNP forward and, while passing over genomic gaps <10 Kb, it skips gaps >10 K and re-starts from the first SNP after a gap. We acknowledge that specifying 40 kb as window size was subjective, but it was motivated from previous studies and by the desire of having a sufficiently large number of SNPs in a window. According to Rubin et al. [17] spurious fixation signals are more likely to occur when few chromosomes are sampled from a DNA pool and inadequate numbers of polymorphic loci in windows are analyzed. Thus to avoid noise in estimates of non-uniform windows we removed windows <30 K and those with less than 10 SNPs for further analyses.
Figure 7
A graphical comparison of two genome scanning strategies.
sliding windows (SW) vs., creeping windows (CW). With SW a chromosome is split into non (or partly) overlapping windows of 40 K and while passing over genomic gaps, it may not perfectly overlie a selective sweep. CW implements an elevated resolution moving windows in steps of only one SNP forward. The approach bridges small (<10 K) gaps while it stops at larger gaps and re-starts at the opposite side. CW always centers a window relative to a sweep position.
A graphical comparison of two genome scanning strategies.
sliding windows (SW) vs., creeping windows (CW). With SW a chromosome is split into non (or partly) overlapping windows of 40 K and while passing over genomic gaps, it may not perfectly overlie a selective sweep. CW implements an elevated resolution moving windows in steps of only one SNP forward. The approach bridges small (<10 K) gaps while it stops at larger gaps and re-starts at the opposite side. CW always centers a window relative to a sweep position.
Assessing statistical significance
We followed the ideas of Churchill and Doerge [46] in applying a permutation approach to define empirical significance thresholds for any individual window. For this, the SNP positions are taken as fixed and allele frequencies are randomly shuffled across positions in each iteration. This is followed by computing pooled heterozygosities for creeping windows of 40 K from the shuffled data and the genome wide lowest H from a window ≥30 K formed by ≥10 SNPs is stored. After repeating this procedure for n iterations, the empirical threshold pertaining to error probability P = 0.001 is the value cutting of the 0.001 quantile in the ordered vector of minima. We ran the simulation with n = 10’000 iterations, computing H values for 862’400 creeping windows in each iteration. This approach conserves the genome structure, like SNP densities and gap positions, and allows simulated data to be randomly drawn from the allele frequency distribution of the population under study. Hence, we do not assume any particular population genetic model to generate the background allele frequency spectrum, but the expected background pattern of variability is given by the data.Program MSMS [47] was used to simulate genomic samples under a coalescent model with mutation, recombination, and constant population size. In the simulations, we assumed two different scenarios: one is the reference population under neutral conditions, and the other is the test population with a single site under positive selection without recurrent mutations. In each model 100 replications of a chromosome of length = 10 Mbp and sample size = 30 was simulated with a selective sweep evolving at the middle of chromosome for the selection model. The list of simulation parameters used is presented in Table 3. We later simulated pooled NGS data from the genomic samples obtained from MSMS by random sampling of 20 chromosomes in each site independently which is an explicit approximation to the average calling depth in the real data set. Allele frequencies from these data sets were then used to estimate the profile of heterozygosity over creeping windows in each single sample and averaged over the number of replications.
Table 3
Parameters for the MSMS simulations.
Parameter
Value
Sequence length
l
10’000’000 bp
Sample size
n
30
Population scaled mutation rate (per site)
θ
10−8
Population scaled recombination rate (per site)
ρ
10−8
Effective population size
Ne
10,000
Number of SNPs
50’000 bp
Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Chromosome wide distribution of variability measured in overlapping windows of 40 k.(PDF)Click here for additional data file.Phred quality score distribution.(PDF)Click here for additional data file.Distribution of the calling read of SNPs in final data set.(PDF)Click here for additional data file.A genome wide inter marker distance between neighboring markers before data cleaning.(PDF)Click here for additional data file.Frequency distribution of minor allele frequencies involved in final analysis.(PDF)Click here for additional data file.The list of genomic regions likely to be under selection (
<0.01, genome-wide).(DOC)Click here for additional data file.
Authors: Rasmus Nielsen; Scott Williamson; Yuseob Kim; Melissa J Hubisz; Andrew G Clark; Carlos Bustamante Journal: Genome Res Date: 2005-11 Impact factor: 9.043
Authors: Scott H Williamson; Melissa J Hubisz; Andrew G Clark; Bret A Payseur; Carlos D Bustamante; Rasmus Nielsen Journal: PLoS Genet Date: 2007-04-20 Impact factor: 5.917
Authors: Timothy M Beissinger; Candice N Hirsch; Brieanne Vaillancourt; Shweta Deshpande; Kerrie Barry; C Robin Buell; Shawn M Kaeppler; Daniel Gianola; Natalia de Leon Journal: Genetics Date: 2013-12-30 Impact factor: 4.562
Authors: Lingyang Xu; Derek M Bickhart; John B Cole; Steven G Schroeder; Jiuzhou Song; Curtis P Van Tassell; Tad S Sonstegard; George E Liu Journal: Mol Biol Evol Date: 2014-11-26 Impact factor: 16.240
Authors: Brock A Harpur; Clement F Kent; Daria Molodtsova; Jonathan M D Lebon; Abdulaziz S Alqarni; Ayman A Owayss; Amro Zayed Journal: Proc Natl Acad Sci U S A Date: 2014-01-31 Impact factor: 11.205
Authors: Timothy M Beissinger; Guilherme J M Rosa; Shawn M Kaeppler; Daniel Gianola; Natalia de Leon Journal: Genet Sel Evol Date: 2015-04-17 Impact factor: 4.297
Authors: Almas A Gheyas; Clarissa Boschiero; Lel Eory; Hannah Ralph; Richard Kuo; John A Woolliams; David W Burt Journal: DNA Res Date: 2015-04-29 Impact factor: 4.458
Authors: Seth N Redmond; Karin Eiglmeier; Christian Mitri; Kyriacos Markianos; Wamdaogo M Guelbeogo; Awa Gneme; Alison T Isaacs; Boubacar Coulibaly; Emma Brito-Fravallo; Gareth Maslen; Daniel Mead; Oumou Niare; Sekou F Traore; N'Fale Sagnon; Dominic Kwiatkowski; Michelle M Riehle; Kenneth D Vernick Journal: BMC Genomics Date: 2015-10-13 Impact factor: 3.969