Literature DB >> 31744455

Mapping genomic regions affecting milk traits in Sarda sheep by using the OvineSNP50 Beadchip and principal components to perform combined linkage and linkage disequilibrium analysis.

Mario Graziano Usai1, Sara Casu2, Tiziana Sechi1, Sotero L Salaris1, Sabrina Miari1, Stefania Sechi1, Patrizia Carta1, Antonello Carta1.   

Abstract

BACKGROUND: The detection of regions that affect quantitative traits (QTL), to implement selection assisted by molecular information, remains of particular interest in dairy sheep for which genetic gain is constrained by the high costs of large-scale phenotype and pedigree recording. QTL detection based on the combination of linkage disequilibrium and linkage analysis (LDLA) is the most suitable approach in family-structured populations. The main issue in performing LDLA mapping is the handling of the identity-by-descent (IBD) probability matrix. Here, we propose the use of principal component analysis (PCA) to perform LDLA mapping for milk traits in Sarda dairy sheep.
METHODS: A resource population of 3731 ewes belonging to 161 sire families and genotyped with the OvineSNP50 Beadchip was used to map genomic regions that affect five milk traits. The paternally and maternally inherited gametes of genotyped individuals were reconstructed and IBD probabilities between them were defined both at each SNP position and at the genome level. A QTL detection model fitting fixed effects of principal components that summarize IBD probabilities was tested at each SNP position. Genome-wide (GW) significance thresholds were determined by within-trait permutations.
RESULTS: PCA resulted in substantial dimensionality reduction, in fact 137 and 32 (on average) principal components were able to capture 99% of the IBD variation at the locus and genome levels, respectively. Overall, 2563 positions exceeded the 0.05 GW significance threshold for at least one trait, which clustered into 75 QTL regions most of which affected more than one trait. The strongest signal was obtained for protein content on Ovis aries (OAR) chromosome 6 and overlapped with the region that harbours the casein gene cluster. Additional interesting positions were identified on OAR4 for fat content and on OAR11 for the three yield traits.
CONCLUSIONS: PCA is a good strategy to summarize IBD probabilities. A large number of regions associated to milk traits were identified. The outputs provided by the proposed method are useful for the selection of candidate genes, which need to be further investigated to identify causative mutations or markers in strong LD with them for application in selection programs assisted by molecular information.

Entities:  

Mesh:

Year:  2019        PMID: 31744455      PMCID: PMC6862840          DOI: 10.1186/s12711-019-0508-0

Source DB:  PubMed          Journal:  Genet Sel Evol        ISSN: 0999-193X            Impact factor:   4.297


Background

The identification of genomic regions that affect traits of interest and the application of marker- or gene-assisted selection [1] in livestock are crucial to speed up genetic improvement. This is especially valid in species for which genetic gain is hampered by the relatively high costs of large-scale phenotyping and the logistic constraints of artificial insemination [2]. However, application of selection assisted by molecular information for traits that are influenced by numerous loci, each one explaining a small portion of the trait variance, is limited by the lack of power of experiments based on low-density marker maps [3]. In sheep, attempts to identify regions that affect quantitative traits (QTL) were performed first by using microsatellite maps [4-13]. Most of the identified QTL showed low significance levels and large confidence intervals, and thus their use in selection programs was not possible [2]. Nevertheless, the discovery of thousands of single-nucleotide markers (SNPs) and cost-effective tools (SNP arrays) to genotype them on a large number of animals as well as the recent availability of affordable whole-genome sequencing techniques, has opened new perspectives. It is expected that polymorphisms with small effects that collectively explain an increasing amount of the genetic variance may be gradually identified by using denser molecular marker maps on larger structured resource populations [14]. Thus, individuals that belong to pre-existing and new dairy sheep experimental populations have been accurately recorded for several traits and genotyped with 50 K and/or 600 K SNP chips. This is the case of an experimental flock of Sarda ewes, which has been set up since 2000 as a resource population to implement selection assisted by molecular information in the breeding scheme of this Italian dairy breed. QTL detection studies based on SNP arrays and the availability of increasingly accurate genome asec15 nnotation data allow the listing of potential candidate genes [15-20]. Moreover, whole-genome re-sequencing of target animals has been used to restrict the number of candidate polymorphisms. Thus, putative causative mutations that affect traits of economic interest have been proposed [21-24]. Among the available QTL detection approaches, those based on the combined use of linkage disequilibrium (LD) and linkage analysis (LA) information (LDLA) have been indicated as the most powerful, robust and precise in populations that are structured in families [16, 25, 26]. The main reason is that they account for both recombination events that occurred within genotyped generations and historical recombination events that occurred in generations prior to genotyping [25]. Several approaches have been proposed to combine LD and LA information [25, 27–31]. The classical LDLA method [25, 32] performs a variance component analysis at each putative QTL position by using identity-by-descent (IBD) probabilities between haplotypes. The main issue is that the IBD probability matrix is often dense, non-positive definite and computationally demanding for its inversion [29, 33]. Thus, either strategies that perform the hierarchical clustering of haplotypes based on IBD probability [33-35] or the approximation of the IBD based on the extent of the identity-by-state status between haplotypes [36] have been used. However, these approximations inevitably result in a loss of information [29]. An alternative way to process IBD information is principal component analysis (PCA), which has the desirable feature of collapsing information that is contained in a set of correlated variables by a smaller set of orthogonal variables. As such, PCA has been proposed as a technique to reduce the dimensionality of predictors in genomic selection [37, 38]. The aim of this study was to detect genomic regions that affect milk traits in the resource population of Sarda sheep by applying an LDLA approach combined with PCA to overcome computational issues of the IBD matrix and minimize the loss of information.

Methods

Resource population

The generation of the resource population (RP) started in 1999 when 10 Lacaune × Sarda F1 sires were mated to Sarda ewes to produce 928 back-cross female lambs in the framework of an European project aimed at detecting QTL in the main European sheep breeds (QLK5-CT-2000-00656; “genesheepsafety”). Subsequently, we focused on the detection of QTL segregating in the pure Sarda breed, and since 2002 we used exclusively Sarda rams (SA) to produce the yearly replacement of RP. Until 2009, the average size of the sire families was 43 daughters whereas, from 2010 onward the average size of families decreased to nine daughters, in order to increase the number of Sarda bloodlines represented in the RP. Sarda sires were always chosen based on their genetic impact on the registered population among rams belonging to the artificial insemination centre of the breed. In total, 3949 ewes from 161 rams (10 F1 and 151 SA) were generated until 2015. Ewes of RP were kept until the 4th (occasionally the 5th) lactation on an experimental farm. The farming system was similar to that commonly applied in Sardinia with most of the adult ewes lambing in autumn and yearlings lambing between January and March. The ewes were milked twice a day by machine from weaning (3–4 weeks from lambing) until the end of July. The feeding regime was based on controlled grazing, supplemented by hay and concentrates in winter and late spring.

Genotypes and phenotypes

All the ewes of RP and their sires as well as the 10 Lacaune sires of F1 and 11 Sarda sires of SA were genotyped with the Illumina Inc. OvineSNP50 Beadchip. SNP editing was performed using call rate and MAF thresholds of 95% and 1%, respectively. The ovine genome assembly v4.0 and the software SNPchimMpv.3 [39] were used to construct the genetic map by assuming 1 Mb = 1 cM. Unmapped SNPs and SNPs on sex chromosomes were not included in the study. A large range of traits of economic interest was measured in the RP. In the current study, we focused on milk traits: milk yield (MY); fat yield (FY); protein yield (PY); fat content (FP) and protein content (PP). MY, FP and PP were measured twice a month during the milking period at the am and pm milking. Lactation records were computed by the Fleischmann method, using records from the milking period only (in agreement with ICAR recommendations), by considering an initial suckling period of 30 days. Finally, 13,059 lactations of 3731 ewes recorded from 2000 to 2017 were retained. The average number of records per ewe was 3.5 ± 1.02, ranging from 1 (5% of animals) to 5 (9.3% of animals); most of the ewes (55.5%) had four records. First, in order to adjust for the main environmental effects, raw lactation records were analysed with single-trait repeatability animal models using the ASReml 4.1 software [40]. Genetic relationships between animals were taken into account by calculating the genomic relationship matrix [41] between 4513 animals, including F1 and SA sires and their genotyped ancestors. The animal model included as fixed effects the year-management-group interaction (37 levels), the year-month of lambing-parity-age class interaction (230 levels) and the milking length within age class (adult and primiparous) as a covariate. The average performance deviation (APD) of each ewe was calculated as the average of lactation records adjusted for fixed effects. The APD used in this study as pseudo-phenotypes for QTL detection differ from the yield deviations [42] that were used in similar studies in that the performances are not adjusted for permanent environmental effects in order to prevent inaccurate estimations due to the likely confounding between permanent environment and additive genetic effects. Indeed, Pearson’s correlations between additive genetic and permanent environmental effects from the repeatability animal model ranged, for the five analyzed traits, from 0.46 for FY to 0.50 for PP. Moreover, although in this study we shall investigate only additive effects, ADP include dominance and epistatic genetic effects when they exist and are suitable pseudo-phenotypes for testing such effects in further analyses. Finally, 3731 APD from as many ewes were available. To verify the suitability of the applied animal model, the ratio between the estimated genomic and total variance was compared with the heritabilities reported in the literature. In the same way, the correlations between APD and GEBV of different traits were compared with phenotypic and genetic correlations estimated in other studies.

Classification of gametes and reconstruction of gametic phases

By “gamete”, we refer to the whole haploid set of autosomes that are inherited by an individual from one of the two parents. Moreover, we classified the gametes of the population as base haplotypes (BH) when inherited from an ungenotyped parent and replicated haplotypes (RH) when inherited from a genotyped parent. The pool of BH comprised both gametes of F1 rams and of 63 SA rams, the maternal or paternal gametes of the 35 ewes with an unknown sire or dam, respectively, and the maternal gametes of the 928 back-cross ewes and of 85 SA rams for which the sire was genotyped. Finally, 1207 gametes were classified as BH, i.e. the 10 F1 sires paternal Lacaune gametes () and 1197 Sarda origin gametes (). Then, all 7462 gametes () carried by the 3731 ewes with production records () were considered as replicates (RH) of the 1207 BH (). An example of how gametes were classified is given in Appendix. The paternal and maternal inherited gametes of all the genotyped individuals were reconstructed by using a procedure based on the linkage disequilibrium multilocus iterative peeling method proposed by Meuwissen and Goddard [43]. In this method, the parental origin of the alleles carried by an individual is iteratively inferred on the genotypes of parents and offspring at a given locus if they are already phased or at the neighbouring phased loci if the phase at the given locus is unknown. Here, the LD at the population level was ignored, since the population structure was expected to allow a high level of precision using family relationships only. For individuals with both parents without a genotype, the paternal or maternal origin was arbitrary assigned. Genotypes for which the parental origin of alleles was assigned with a probability lower than 0.99 were assumed missing.

Calculation of IBD probabilities

The marked familial structure of the RP led us to exploit the information from the within-family linkage analysis (LA) in addition to that from the population-wide linkage disequilibrium (LD) to estimate IBD probabilities. First, IBD probabilities between BH and RH were calculated by LA () given the known gametic phases and the pedigree information. The grand-parental origin of each RH was estimated at each SNP position with certainty when the genotype at a given position was not missing and the parent transmitting RH was heterozygous. When these conditions were not fulfilled, the probability of a grand-parental origin at a given locus was determined based on information from the closest neighbouring informative loci [44, 45]. Then, transmission from BH to RH was traced through generations following Fernando and Grossman [46]. At each SNP position , probabilities were stored in a matrix with size . Moreover, the number of replicates of a given BH in RP at each SNP position () was calculated as , where is a vector of ones (see Appendix). Secondly, IBD between BH were estimated by LD analysis () at each SNP position following Meuwissen and Goddard [47]. The probability was conditioned to the identity-by-state (IBS) status of neighbouring SNPs using windows of 21 SNPs (10 upstream and 10 downstream resulting in an average window length of 1 Mb) and to the within-breed (Lacaune or Sarda) expected homozygosity. The between and were assumed to be null. At each SNP position , probabilities were stored in a matrix with size . Once we had precisely estimated the covariances between BH and between BH and RH as well as the BH number of replicates at each locus, the average values across loci were also calculated and stored in (), () and (), respectively. These matrices will be used later to estimate genome-wide IBD probabilities between gametes to adjust for the polygenic effect of background genes.

Principal component analysis and QTL detection model

Hereafter we describe a novel LDLA approach for QTL mapping that, similarly to the basic LDLA model proposed by Meuwissen et al. [25], relies on the IBD information at the locus level and takes the effect of background genes into account. In their study, Meuwissen et al. [25] modelled the phenotypic records by the random effects of all the inherited gametes (, twice the number of individuals) at a given position and the random polygenic effects (, i.e. the combined effect of background genes) of all the individuals. In the original model [25]: , the covariance matrix of gametic effects included IBD probabilities between founder gametes that were obtained by LD analysis () [47] and IBD probabilities between founder and non-founder gametes and between non-founders gametes that were obtained by combining the corresponding transmission probabilities () with , using the algorithm described by Fernando and Grossman [46]. The additive polygenic covariance between individuals was considered through the numerator relationship matrix based on pedigree information. In the basic method [25], the maximum likelihood estimates of the variance components were calculated at each putative QTL position . The application of this method implies some relevant issues related to the nature of the matrix, which is usually dense and may turn out to be non-positive definite, and the computational needs in applying the variance component analysis at each investigated position. To overcome these issues, we propose a novel approach which uses the principal component analysis (PCA) to handle and exploits the dimensional reduction of the model equations that may be achieved by PCA to estimate both QTL and polygenic effects. First, PCA is used to capture the IBD information at the locus level with the aim of overcoming the difficulties in inverting the matrix, in a different way from previous strategies having the same purpose [33–36, 48] which frequently result in a loss of information [29]. Second, PCA is applied to the matrix of the genome- wide IBD probabilities between gametes (i.e. the average across loci of IBD probabilities locus-wide) which is used instead of the classical numerator relationship matrix () to take into account the polygenic effects. A similar approach was used by Rothammer et al. [49, 50], which applied PCA to reduce the dimension of the relationship matrix and used explanatory PC as fixed effects in their QTL detection model. Third, PC that explain most of the variability of both the locus-level and genome-wide IBD probability matrices are included in the model as fixed effects to estimate both QTL and polygenic effects by performing a least square analysis instead of the more computationally demanding variance component one. Thus, at each SNP position the model is the following:where is a vector of APD of ewes for MY, PY, FY, PP and FP; is the overall mean; is a vector of the fixed effects of the principal components that explain more than 99% of the within breed variation () of the IBD probability matrix , i.e. summarizes the effects of haplotypes at the QTL position ; is a vector of the fixed effects of the principal components that explain more than 99% of the variation () of the genome-wide IBD probability matrix, i.e. summarizes the polygenic effects of the gametes; is a vector of ones; is a incidence matrix relating phenotypes with RH; is a matrix including the scores of RH, is a matrix including the scores of RH; is a vector of residuals assuming that with a diagonal matrix with the APD’s reliability () as diagonal element. For each investigated trait (MY, PY, FY, PP and FP), reliabilities were calculated as , from a repeatability linear model , where is the performance deviation (i.e. the lactation record adjusted for the fixed effects estimated with the full animal model) of ewe , is the random ewe effect assuming that and is the corresponding error, assuming that . Below, we explain how the PC scores of the and matrices were calculated. In addition, a numerical example is given in Appendix. As far as the elements are concerned, in order to limit the computational requirements to extract PC directly from the large () matrix, the PCA was performed on a matrix denoted as , where the probabilities between BH, stored in , were weighted for the probabilities between BH and RH by condensing information in a diagonal matrix , in which the diagonal elements are the number of RH of each BH (stored in , where ). The matrix is defined as:PCA was carried out on by using the Jacobi algorithm. Eigenvectors () relating to the largest principal components that together explain more than 99% of the within-breed variation () were retained. Finally, probabilities between BH and RH () were combined with to define the scores of RH () by: Note that when between BH and RH are estimated with certainty and, thus, only contains 0 and 1, then: and ; eigenvalues from correspond to eigenvalues from for the explanatory principal components () and scores from correspond to (see Appendix). When between BH and RH are estimated without certainty and contains intermediate values between 0 and 1, scores from and do not correspond perfectly and differences tend to increase as the uncertainty of probabilities increases. This is because only considers the covariance that derives from excluding the covariance between RH pairs generated by imprecise estimation of transmissions of BH. This effect is negligible in our experiment because most transmission probabilities are estimated with certainty. Since the between and was set to 0 and two sets of breed-specific PC were obtained, the matrix can be detailed as . where and are the summarising IBD probabilities between the gametes of Sarda and Lacaune origin, respectively. In Eq. (1) elements, which are related by the incidence matrix to phenotypes, are used as covariates on the investigated traits to estimate QTL effects at locus (). As far as the elements are concerned, the PCA performed directly on the weighted genome-wide matrix, () computed as in Eq. (2) resulted in 1022 PC that were needed to capture 99% of the total variation. This limited dimensional reduction is due to the moderate genome-wide . probabilities between BH (on average around 0.1) and the small number of replicates of some BH on RP. Thus, in order to not over-parameterize the model, we considered the BH with the highest impact on RP (). Then, to recover information from BH with few RH, a matrix of coefficients relating to all the BH was calculated as:where is the section of including probabilities between all the BH with and is the inverse of the section of including probabilities between pairs. The average number of replicates per was then updated as . The set was iteratively selected as the smallest group of BH that satisfied the condition that is higher than 0.99. According to the analysis at the locus level, the Jacobi algorithm was performed on the matrix computed as:where is a diagonal matrix, with its diagonal elements being the number of replicates stored in . Eigenvectors () of the largest principal components that together explain more than 99% of the total variation of () were retained. Finally, genome-wide probabilities between BH and RH () were combined with to define scores of RH () by: In Eq. (1), , scores which are related by the incidence matrix to phenotypes, are used as covariates on the investigated traits to estimate polygenic effects (). Note that and vectors in Eq. (1) are both fixed effects that are depicted separately to highlight that the model aims at estimating QTL effects () while adjusting for the background of polygenes (). Moreover, covariates related to are locus-specific whereas covariates related to remain constant throughout the genome. In addition, in accordance with , can be detailed as , where and summarize the effects of the Sarda and Lacaune gametes, respectively. The model was tested at each SNP position by F-tests. Three null hypotheses were tested, :  = 0; :  = 0 and :  = 0. In the current study, only tests, corresponding to :  = 0, will be analysed and discussed. Genome-wide (GW) significance thresholds were determined by 2000 within-trait permutations of the residuals () of the reduce model , where only the polygenic effects were considered. In order to break free from differences in the number of degrees of freedom at each SNP position, genome-wide maxima of the negative logarithms of p-values from each permutation were used to construct the null distribution.

Results

Production data and phenotypes for QTL detection

The ratios between genomic and total variance estimated by a single-trait animal model (Table 1) are consistent with the estimates of heritabilities for dairy traits in the literature [2]. Content traits were more heritable than yield traits. The most and the less heritable traits were PP and PY, respectively. APD and GEBV correlations between traits showed similar values and were consistent with phenotypic and genetic correlations reported in previous studies on sheep [2]. Strong correlations (from 0.88 to 0.95) were observed between yield traits, and moderate positive correlations were observed between content traits (0.58 and 0.62). MY was negatively correlated with both content traits, while the correlations of the other two yield traits with content traits were low.
Table 1

Ratios between genomic and total variance (diagonal, standard errors of estimates in brackets) and correlations between APD (above the diagonal) and between GEBV (below the diagonal)

TraitsMYFYPYFPPP
MY0.35 (0.02)0.900.95− 0.32− 0.34
FY0.880.33 (0.02)0.920.10− 0.10
PY0.930.910.31 (0.02)− 0.15− 0.03
FP− 0.340.12− 0.140.55 (0.02)0.58
PP− 0.42− 0.14− 0.080.620.61 (0.02)

MY milk yield, FY fat yield, PY protein yield, FP fat content, PP protein content

Ratios between genomic and total variance (diagonal, standard errors of estimates in brackets) and correlations between APD (above the diagonal) and between GEBV (below the diagonal) MY milk yield, FY fat yield, PY protein yield, FP fat content, PP protein content

Reconstruction of gametic phases

Preliminary editing of data led to remove SNPs with more than 5% missing genotypes and with a minor allele frequency (MAF) lower than 0.01. Only SNPs located on the 26 autosomes were retained. The phasing procedure allowed the reconstruction of the sequence of the alleles carried by the investigated BH and RH for more than 99.5% of the SNP positions. After phasing, another 120 SNPs were excluded, because their genotypes were inconsistent with the phase estimated from neighbouring SNPs. Finally, 43,390 SNPs were retained for further analyses. The explored genomic portion was 2437 Mb long and the average distance between SNPs was 56 ± 49 kb with a maximum gap of 2.377 Mb on Ovis aries (OAR) chromosome 21.

IBD probability calculation

On average, the maximum locus-wide probability between each RH and all the BH was 0.99, which indicates the precise reconstruction of the meioses occurred across RP generations. The distribution of the genome-wide number of replicates of the 1207 BH in RP () are depicted in Fig. 1. The impact of BH on RP was extremely variable, in fact the average number of replicates per BH ranged from 1 to 202. The locus-wide probability between BH pairs was zero in 68% of cases. The distribution of non-zero probabilities is shown in Fig. 2. The 13% and 6% of the locus-wide were lower than 0.05 and higher than 0.95, respectively, which suggests that, for a large proportion of BH pairs, it would have been possible to approximate the IBD status to 0 or 1. However, another 13% of locus-wide showed intermediate values, for which the approximation to 0 or 1 would have been less accurate.
Fig. 1

Genome-wide average number of replicates () per base gamete (BH, n = 1207) across the gametes of the resource population (RH; n = 7462)

Fig. 2

Distribution of non-zero probabilities between base gametes (BH; n = 1207) at the locus (blue) and genome-wide (orange) levels

Genome-wide average number of replicates () per base gamete (BH, n = 1207) across the gametes of the resource population (RH; n = 7462) Distribution of non-zero probabilities between base gametes (BH; n = 1207) at the locus (blue) and genome-wide (orange) levels Most of the BH pairs showed a genome-wide probability equal to 0.10, with 95% having values ranging from 0.07 to 0.16 (Fig. 2). This result confirms that the original pool of Sarda gametes as well as the rams used to generate the yearly replacement of RP show a rather large genetic variability.

Principal component analysis and QTL detection

The distribution of the number of PC needed to capture 99% of the locus-wide variability is shown in Fig. 3. The average number of was 32.3 ± 6.4 with a maximum of 74 and a minimum of 9. As far as the breed of origin is concerned, the number of explaining 99% of variation due to (see Additional file 1) and were 24.1 ± 6.0 and 8.2 ± 1.2, respectively.
Fig. 3

Frequencies across all loci (43,390 SNPs) of the number of principal components capturing 99% of the variation at the locus level ()

Frequencies across all loci (43,390 SNPs) of the number of principal components capturing 99% of the variation at the locus level () Concerning the genome-wide analysis, 139 BH with the highest impact on RP () were selected on the basis of and the genome-wide probabilities between BH pairs. The set included all 10 Lacaune gametes and 129 Sarda gametes. The sum of the original number of replicates () of was 0.69 (i.e. 69% of the RH were replicates of ). The remaining 30% of the RH variation was accounted for through the coefficients included in the matrix and derived from probabilities between and other BH. At the genome-wide level, the number of needed to capture 99% of the variation was 137. The distributions of the genome-wide maxima of , corresponding to the null hypothesis :  = 0, obtained by 2000 within-trait permutations, did not show relevant differences across traits (see Additional file 2). The 5% threshold ranged from 5.59 to 5.69. Thus, the most conservative value, 5.69, was retained as the common GW threshold for all the analysed traits. Overall, 2563 positions exceeded the 0.05 GW significance threshold for at least one trait (Fig. 4) (see Additional file 3). There were 200, 108, 122, 918 and 1927 SNP positions significantly associated with MY, FY, PY, FP and PP, respectively. The number of significant positions affecting simultaneously one, two, three and four traits was 1943, 546, 56 and 18, respectively. Several significant positions were adjacent, which may be due to linkage disequilibrium between locations. In order to account for such dependency, significant positions were clustered into QTL regions (QTLR). The correlations between (corresponding to the second term of the model Eq. 1) were calculated for all pairs of significant SNPs on the same chromosome. Then, the strongest signal at the chromosome level was retained as the peak of the first QTLR. The peaks of further QTLR along the chromosome were iteratively identified among the significant locations that had correlations lower than 0.15 with the already defined QTLR peaks. Finally, the remaining significant positions were assigned to the QTLR with which they had the highest correlation. When QTLR for different traits overlapped, we considered them as a unique QTLR. This procedure may underestimate the true number of QTL if more than one gene affecting the trait(s) is located in the same genome region.
Fig. 4

Manhattan plots showing − log10 (nominal p-values) corresponding to the null hypothesis that the effects of principal components that explain 99% of the variability due to the Sarda base gametes (BHS) at each locus (43,390 SNPs) are zero. The dashed black lines indicate the 0.05 genome-wide significance threshold determined by permutations. MY milk yield; FY fat yield; PY protein yield; FP fat content; PP protein content

Manhattan plots showing − log10 (nominal p-values) corresponding to the null hypothesis that the effects of principal components that explain 99% of the variability due to the Sarda base gametes (BHS) at each locus (43,390 SNPs) are zero. The dashed black lines indicate the 0.05 genome-wide significance threshold determined by permutations. MY milk yield; FY fat yield; PY protein yield; FP fat content; PP protein content Details of the 75 defined QTLR are in Table 2. QTLR were detected across all 26 autosomes except OAR26. The largest number of QTLR (10) was detected on OAR1. Overall, 12, 11, 10, 46 and 43 QTLR significantly affected MY, FY, PY, FP and PP, respectively. Among these 75 QTLR, 46 were significant for one trait only: two for MY, two for FY, two for PY, 23 for FP and 17 for PP; 20 were significant for two traits: one for MY and FY, one for MY and PP and 18 for FP and PP; two QTLR were significant for three traits: one for MY and the two content traits and one for the three yield traits; five QTLR were significant for four traits: one for the three yield traits and FP, three for the three yield traits; and PP, and one for FY, PY, FP and PP; and finally two QTLR significant for all five investigated traits.
Table 2

Details of the QTL regions that include SNP positions exceeding the 0.05 genome-wide significance threshold

QTL regionOARSignificant SNP (n)Highest peakSignificant region (Mb)Max − log10 (p-value)
SNP namePos. (Mb)MYFYPYFCPC
111rs4228621548.208.2–8.26.4
213rs41474590216.2215.6–16.36.0
3187rs39945956938.4631.1–63.79.710.0
412rs422745101100.96100.9–101.06.1
511rs402912954136.40136.4–136.46.05.8
6110rs415285988145.06142.1–145.38.0
714rs424147980168.06168.1–168.17.4
812rs430144352206.80206.4–206.85.9
912rs399774250253.86253.9–255.85.9
101102rs426289520261.02258.1–275.26.612.2
1121rs42074005213.3513.4–13.45.7
1222rs42064720016.6616.7–22.76.5
1325rs40696104424.3124.0–27.27.27.36.55.7
1421rs42004329733.1733.2–33.25.7
1525rs42825193071.4065.4–71.85.96.6
1628rs404690479131.48130.1–140.87.3
172134rs403115176207.86204.9–223.25.89.6
1831rs4007678354.264.3–4.35.8
19315rs42659159524.6424.4–57.56.26.6
20363rs40297916892.5175.4–119.47.16.76.99.07.3
213145rs414469986137.31133.8–144.89.214.9
2237rs400309601179.14177.4–179.46.8
23339rs425759731194.13193.8–209.96.79.5
244131rs42181516712.345.9–24.914.911.8
25452rs42689588755.1330.6–55.38.67.3
2647rs41463347868.1466.6–78.16.16.3
2752rs41952857410.5210.5–10.57.3
28510rs41485372811.4511.4–14.27.1
29527rs40493133486.9172.6–93.17.16.16.96.8
30510rs405537538101.53100.1–106.96.6
3161rs41674351720.8720.9–20.95.7
3261rs40659497922.3922.4–22.45.7
336802rs42382327085.3536.2–105.211.667.0
34712rs43067131173.2545.3–73.36.9
3586rs4112592429.719.7–13.06.9
36824rs40409117279.8052.9–84.67.9
3791rs42393380913.5313.5–13.56.0
3891rs42209333814.8614.9–14.95.9
3992rs42578246331.1130.9–31.16.1
40107rs42716832719.7918.3–20.27.1
411061rs40195518455.9148.6–72.39.57.09.79.1
42101rs41567058784.9484.9–84.96.3
431148rs4289233029.850.7–14.26.59.6
4411104rs42536917955.4343.1–60.311.69.210.78.18.1
45121rs42300687513.9113.9–13.95.9
46122rs40443417821.9822.0–33.86.1
471229rs40482194568.0451.3–68.16.57.8
4813223rs40685606958.5836.5–73.87.410.113.1
49132rs41851710378.1278.1–78.15.9
501423rs40416476226.8211.0–27.36.17.55.87.2
51147rs42572341050.1933.9–50.37.4
52155rs41495482126.1726.1–32.76.5
53153rs42169563361.2856.4–61.36.7
54152rs42104365476.6176.6–76.65.9
55169rs4122716336.045.9–25.36.96.0
561626rs41202573127.9327.9–37.98.26.86.56.6
571680rs42817080957.1038.7–63.77.010.8
581639rs41940031570.1463.7–70.36.98.8
591730rs40860248026.828.1–54.77.37.6
601812rs41845795828.7525.9–34.97.9
611927rs42545773813.156.7–15.37.58.8
62191rs40603178922.7922.8–22.86.1
63193rs39922572923.7223.7–23.86.3
64191rs40295219025.0625.1–25.15.9
652028rs42288077925.704.4–35.38.66.2
66211rs40026475420.1720.2–20.25.7
67218rs41809027729.3229.1–30.67.4
68221rs42732721210.8610.9–10.95.7
69227rs16148089918.0718.0–24.27.7
70226rs39990782133.7531.1–38.97.4
71236rs42793234059.7451.5–59.98.2
72244rs41615328317.588.4–17.67.8
73248rs42257640140.3637.4–40.46.3
742518rs42187223916.5611.1–18.97.17.15.8
75251rs40522583331.4831.5–31.56.0

QTL, region identifier; OAR, Ovis aries autosomes; Significant SNP (n) number of SNP positions exceeding the 0.05 genome-wide significance threshold [− log10(p-value) > 5.69] for at least for one trait; Highest peak, the most significant SNP across traits; SNP name, Pos. (Mb), name and position in Mb (from the ovine genome assembly v4.0 of the most significant SNP); Significant region (Mb), position in Mb of the first and last significant SNP of the QTL region; Max − log10 (p-value), highest significance per trait among the SNPs within a QTL region exceeding the genome-wide threshold of 0.05; − log10 (p-value), negative logarithm of the p-value corresponding to the null hypothesis that the effects of principal components that explain 99% of the variability due to the Sarda base gametes (BHS) are zero; MY, milk yield; FY, fat yield; PY, protein yield; FP, fat content; PP, protein content

Details of the QTL regions that include SNP positions exceeding the 0.05 genome-wide significance threshold QTL, region identifier; OAR, Ovis aries autosomes; Significant SNP (n) number of SNP positions exceeding the 0.05 genome-wide significance threshold [− log10(p-value) > 5.69] for at least for one trait; Highest peak, the most significant SNP across traits; SNP name, Pos. (Mb), name and position in Mb (from the ovine genome assembly v4.0 of the most significant SNP); Significant region (Mb), position in Mb of the first and last significant SNP of the QTL region; Max − log10 (p-value), highest significance per trait among the SNPs within a QTL region exceeding the genome-wide threshold of 0.05; − log10 (p-value), negative logarithm of the p-value corresponding to the null hypothesis that the effects of principal components that explain 99% of the variability due to the Sarda base gametes (BHS) are zero; MY, milk yield; FY, fat yield; PY, protein yield; FP, fat content; PP, protein content The strongest signal was obtained for PP on OAR6 at 85.34 Mb where a nominal p-value of 11.12*10−67 was observed. The corresponding QTLR harboured significant positions also for FP and MY. The most significant position for FP (p-value = 1.26*10−15) was observed at 12.34 Mb on OAR4. This QTLR affected also PP. The most significant results for the three yield traits (p-value = 2.41*10−12, 5.89*10−10 and 2.20*10−11 for MY, FY and PY, respectively) were detected at 55.43 Mb on OAR11 where a significant peak for FP was also identified (Table 2).

Discussion

Power, precision, robustness of QTL mapping experiments in complex populations may be affected by several issues (size of the experiment, number and frequency of base haplotypes, density of marker maps). As described above, the resource population investigated here is constituted by families based on male ancestors. LDLA mapping approaches are expected to be more suitable than linkage and genome-wide association analyses to fine map QTL regions in such populations. In fact, the LDLA analysis combines both the within-family linkage and population-wide linkage disequilibrium information to estimate IBD probabilities between haplotypes [51]. The proposed approach allowed us to solve the model by the least square method, which avoid a computational expensive variance component analysis. The advantages of approaches based on LDLA regression versus those on variance components, in term of ease of use and computing time, are well known [29] and have been clearly demonstrated by Roldand et al. [52]. In our study, LDLA mapping greatly benefits from the structure of the population, in which the ewes born after the first generation have both parents genotyped, which allows a precise reconstruction of the base gametes of the population and their transmission through generations. IBD information can be efficiently captured with PCA and the computational constraints due to the multi-collinearity generated by the high probabilities that may occur between pairs of BH at the locus level can be overcome. The use of PCA avoided the implementation of prior clustering of BH or approximations in the IBD probability estimation [33, 34]. Moreover the strategy used here to collapse IBD information into principal components (i.e. the use of instead of ) is computationally efficient. It relies on the high precision of LA for defining the ancestral origin of each gamete, which is possible for populations with a strong familial structure. The effectiveness of the method in populations with weaker familial structure should be investigated. Depending on the eigenvalues threshold used for PCs selection, the proposed approach allows to capture most of the IBD variation with a dramatic decrease in the number of effects to estimate. Although the direct solutions of the analysis are the effects of explanatory PC, the effect corresponding to each BH at position can be easily calculated (). Effects and frequencies of BH may be used as basic information to identify the BH that contribute most to the significance of a given locus. In fact, in their study on the identification of putative causative mutations that affect the protein content, Casu et al. [22] used such information to select individuals for whole-genome re-sequencing. The QTL detection model proposed here included a fixed factor to adjust for the polygenic background of each individual based on the effect of PC that summarize the genome-wide BH variation. However, at the genome-wide level probabilities between BH had moderate values when averaged across the genome. Indeed, most PC showed small eigenvalues and many of them were necessary to explain most of the variability. Thus, we applied an approach that aimed at reducing the number of BH to be included in the PCA to those that had the highest impact on RP () by taking their probability to be carried by an individual with a record into account. During the development of our method, we applied it to the dataset that was simulated for the XVI QTLMAS meeting [53]. The results of QTL detection were very close to those reported by Garzia Gamez et al. [53] who used a more classical LDLA method [33, 53], which implemented variance component analysis and accounted for an individual random polygenic effect (see Additional file 4). Overall, a large number of genomic regions that significantly affected milk traits were detected in this study. The number of detected regions largely exceeded those obtained by other LDLA studies in dairy sheep for milk production traits [16]. This larger number of detected QTLR is probably due to the larger size of the analyzed population. Indeed, Garzia-Gamez et al. [16] performed a LDLA mapping on a population of about 1700 Churra ewes that were organized in 16 half-sib families and they detected 34 genome-wide significant regions. Consistent with the estimates of heritabilities, the number of QTLR that affected content traits was larger than that for yield traits. Several positions suggested pleiotropic effects. Twenty-nine QTLR affected more than one trait: nine affected at least two yield traits and frequently one or both of the content traits, four affected MY and both content traits, and 18 were significant for both content traits. A very long list of positional candidate genes was obtained by overlapping the sheep genome reference (Oar_v4.0) with each QTLR. Overall, 745 annotated genes were detected but only a few of these were cited as dairy-related in previous studies on cattle [54] and sheep [55]. Among these, the most interesting genes were those in the casein cluster (CSN1S1, CSN1S2, CSN2 and CSN3), which is mapped to OAR6 within the 85.00–85.23 Mb interval. This interval overlaps with the position of the strongest signal for PP found in this study. At almost the same position, a QTL for PP was detected in Churra sheep by GWA [21]. A deeper investigation of this region is ongoing by whole-genome re-sequencing of individuals that carry BH with large effects at the significant location. The aim is to list the candidate causative mutations by performing specific association studies of all the polymorphisms included in the genomic region [22]. The QTLR that affects PP and FP on OAR3 at 137.3 Mb overlaps with the α-lactalbumin gene (LALBA). This gene was previously reported as a strong candidate for PP and has been deeply investigated in the Churra breed [16, 21]. Two other interesting candidate genes are the growth hormone receptor (GHR) and transcription factor AP-2 gamma (TFAP2C) genes. GHR is located on OAR16 within the 31.83-32.00 Mb interval, where a QTLR that affects yield traits was detected. Previous studies in dairy cattle and sheep have shown that GHR affects milk production [54, 56]. TFAP2C, which is involved in the development, differentiation, and oncogenesis of the mammary gland [55], overlaps with a QTLR that is significant for MY, FP and PP on OAR13 at 58.6 Mb.

Conclusions

We present a simple least square model to map QTL. It combines linkage disequilibrium and linkage analysis information and accounts for the polygenic effects of base gametes. The use of principal component analysis was found to be a good strategy to reduce the computational burden. A large number of regions associated to the variability of milk traits were identified. The outputs provided by this method are useful for the selection of individuals and genes that need to be further investigated for identifying causative mutations or markers in strong linkage disequilibrium with causative variants and for implementing them in genomic selection programs. Additional file 1. Number of principal components (N.) needed to explain more than 99% of the variability due to the Sarda base gametes (BHS) at each locus (43,390 SNPs). OAR (x-axis) Ovis aries autosomes. Additional file 2. Distributions of the genome-wide maxima of −log10(p-values) obtained by 2000 within-trait permutations. −Log(p-values) (x-axis) maximum across the genome (43,390 SNPs) of the negative logarithms of the p-values corresponding to the null hypothesis that the effects of principal components that explain 99% of the variability due to the Sarda base gametes (BHS) are zero; MY milk yield; FY fat yield; PY protein yield; FP fat content; PP protein content. Additional file 3. Details on the SNP positions that exceed the genome-wide significance threshold of 0.05 for at least one trait. QTL region identifier; OAR Ovis aries autosomes; SNP name and Position (bp) name and position in base pairs of the significant SNP (from the ovine genome assembly v4.0); n. of significant traits number of traits for which the SNP exceeds the genome-wide threshold of 0.05 [−log10(p-value) > 5.69]; -log(p-value) negative logarithm of the p-value corresponding to the null hypothesis that the effects of principal components that explain 99% of the variability due to the Sarda base gametes (BHS) are zero; MY milk yield; FY fat yield; PY protein yield; FP fat content; PP protein content. Additional file 4. Application of the proposed method to the XVI QTLMAS simulated population data and comparison to LDLA results presented by Garcia-Gamez et al. [53]. Manhattan plots showing −log10(nominal p-values) corresponding to the null hypothesis that the effects of principal components that explain 99% of the variability due to base gametes of the XVI QTLMAS simulated population at each locus (10,000 SNPs) are zero. The dashed black lines indicate the 0.05 genome-wide significance threshold determined by Bonferroni correction for all the tests (10,000). Orange diamonds and grey vertical lines indicate the location of true simulated QTL [53]. Green triangles indicate QTL that were detected by variance component based LDLA mapping [33].
Table 3

Pedigree and phenotypic records for the numerical example

idSireDamSexPhenotype
1M
2M
31F0.180
42F0.796
523F0.972
614F0.631
723F0.823
814F0.796
915F0.545
1026F0.972
1127F0.068
1218F0.807
Table 4

Classification of gametes

Gamete idaClassificationb
1pBH
1mBH
2pBH
2mBH
3pRH
3mcBH and RH
4pRH
4mcBH and RH
5pRH
5mRH
6pRH
6mRH
7pRH
7mRH
8pRH
8mRH
9pRH
9mRH
10pRH
10mRH
11pRH
11mRH
12pRH
12mRH

aGamete ids are defined by the id of the individual with the subscript p or m, which denote the paternal or maternal origin, respectively

bGametes are classified as base gametes (BH) when inherited by an ungenotyped parent or replicates of BH (RH) when inherited by a genotyped parent

cGametes 3m and 4m are classified as BH since the dams of animals 3 and 4 were not genotyped; 3m and 4m will also be treated as RH (i.e. replicates of themselves) since they are associated to phenotypes

Table 5

Grand-parental origin of RH at locus

RHGS gameteGD gameteP(RH = GS)P(RH = GD)
3p1p1m10
3m3m01
4p2p2m01
4m4m01
5p2p2m10
5m3p3m01
6p1p1m01
6m4p4m10
7p2p2m10
7m3p3m10
8p1p1m01
8m4p4m01
9p1p1m10
9m5p5m10
10p2p2m01
10m6p6m01
11p2p2m10
11m7p7m10
12p1p1m01
12m8p8m10

GS Grand-sire, GD Grand-dam, P(RH = GS) probability that RH is a replicate (i.e. identical-by-descent) of a GS gamete, P(RH = GD) probability that RH is a replicate (i.e. identical-by-descent) of a GD gamete

Table 6

Eigenvalues of

Principal componentEigenvalueVariance explained (%)Cumulative variance explained (%)
18.82744.144.1
25.00025.069.1
34.77523.993.0
41.2306.299.2
50.1680.8100.0
60.0000.0100.0
Table 7

Eigenvalues of

Principal componentEigenvalueVariance explained (%)Cumulative variance explained (%)
18.82744.144.1
25.00025.069.1
34.77523.993.0
41.2306.299.2
50.1680.8100.0
60.0000.0100.0
70.0000.0100.0
80.0000.0100.0
90.0000.0100.0
100.0000.0100.0
110.0000.0100.0
120.0000.0100.0
130.0000.0100.0
140.0000.0100.0
150.0000.0100.0
160.0000.0100.0
170.0000.0100.0
180.0000.0100.0
190.0000.0100.0
200.0000.0100.0
Table 8

Eigenvalues of

Principal componentEigenvalueVariance explained (%)Cumulative variance explained (%)
15.05930.330.3
24.20125.155.4
33.75022.477.8
43.71322.2100.0
  42 in total

1.  Fine mapping of quantitative trait loci using linkage disequilibria with closely linked marker loci.

Authors:  T H Meuwissen; M E Goddard
Journal:  Genetics       Date:  2000-05       Impact factor: 4.562

2.  Molecular dissection of a quantitative trait locus: a phenylalanine-to-tyrosine substitution in the transmembrane domain of the bovine growth hormone receptor is associated with a major effect on milk yield and composition.

Authors:  Sarah Blott; Jong-Joo Kim; Sirja Moisio; Anne Schmidt-Küntzel; Anne Cornet; Paulette Berzi; Nadine Cambisano; Christine Ford; Bernard Grisart; Dave Johnson; Latifa Karim; Patricia Simon; Russell Snell; Richard Spelman; Jerry Wong; Johanna Vilkki; Michel Georges; Frédéric Farnir; Wouter Coppieters
Journal:  Genetics       Date:  2003-01       Impact factor: 4.562

Review 3.  Commercial application of marker- and gene-assisted selection in livestock: strategies and lessons.

Authors:  J C M Dekkers
Journal:  J Anim Sci       Date:  2004       Impact factor: 3.159

4.  Multipoint identity-by-descent prediction using dense markers to map quantitative trait loci and estimate effective population size.

Authors:  Theo H E Meuwissen; Mike E Goddard
Journal:  Genetics       Date:  2007-06-11       Impact factor: 4.562

5.  Fine mapping of quantitative trait loci affecting female fertility in dairy cattle on BTA03 using a dense single-nucleotide polymorphism map.

Authors:  Tom Druet; Sébastien Fritz; Mekki Boussaha; Slim Ben-Jemaa; François Guillaume; David Derbala; Diana Zelenika; Doris Lechner; Céline Charon; Didier Boichard; Ivo G Gut; André Eggen; Mathieu Gautier
Journal:  Genetics       Date:  2008-04       Impact factor: 4.562

6.  Detection of quantitative trait loci and putative causal variants affecting somatic cell score in dairy sheep by using a 50K SNP chip and whole-genome sequencing.

Authors:  B Gutiérrez-Gil; C Esteban-Blanco; A Suarez-Vega; J J Arranz
Journal:  J Dairy Sci       Date:  2018-08-09       Impact factor: 4.034

7.  Mapping quantitative trait loci for milk production and genetic polymorphisms of milk proteins in dairy sheep.

Authors:  Francis Barillet; Juan-José Arranz; Antonello Carta
Journal:  Genet Sel Evol       Date:  2005       Impact factor: 4.297

8.  The genomic architecture of mastitis resistance in dairy sheep.

Authors:  G Banos; G Bramis; S J Bush; E L Clark; M E B McCulloch; J Smith; G Schulze; G Arsenos; D A Hume; A Psifidi
Journal:  BMC Genomics       Date:  2017-08-16       Impact factor: 3.969

9.  Mapping quantitative trait loci (QTL) in sheep. II. Meta-assembly and identification of novel QTL for milk production traits in sheep.

Authors:  Herman W Raadsma; Elisabeth Jonas; David McGill; Matthew Hobbs; Mary K Lam; Peter C Thomson
Journal:  Genet Sel Evol       Date:  2009-10-22       Impact factor: 4.297

10.  GWA analysis for milk production traits in dairy sheep and genetic support for a QTN influencing milk protein percentage in the LALBA gene.

Authors:  Elsa García-Gámez; Beatriz Gutiérrez-Gil; Goutam Sahana; Juan-Pablo Sánchez; Yolanda Bayón; Juan-José Arranz
Journal:  PLoS One       Date:  2012-10-18       Impact factor: 3.240

View more
  2 in total

1.  Assessing the Diversity and Population Substructure of Sarda Breed Bucks by Using Mtdna and Y-Chromosome Markers.

Authors:  Maria Luisa Dettori; Elena Petretto; Michele Pazzola; Oriol Vidal; Marcel Amills; Giuseppe Massimo Vacca
Journal:  Animals (Basel)       Date:  2020-11-24       Impact factor: 2.752

2.  Association analysis and functional annotation of imputed sequence data within genomic regions influencing resistance to gastro-intestinal parasites detected by an LDLA approach in a nucleus flock of Sarda dairy sheep.

Authors:  Sara Casu; Mario Graziano Usai; Tiziana Sechi; Sotero L Salaris; Sabrina Miari; Giuliana Mulas; Claudia Tamponi; Antonio Varcasia; Antonio Scala; Antonello Carta
Journal:  Genet Sel Evol       Date:  2022-01-03       Impact factor: 4.297

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.