Literature DB >> 29751743

Dissecting closely linked association signals in combination with the mammalian phenotype database can identify candidate genes in dairy cattle.

Zexi Cai1, Bernt Guldbrandtsen2, Mogens Sandø Lund2, Goutam Sahana2.   

Abstract

BACKGROUND: Genome-wide association studies (GWAS) have been successfully implemented in cattle research and breeding. However, moving from the associations to identifying the causal variants and revealing underlying mechanisms have proven complicated. In dairy cattle populations, we face a challenge due to long-range linkage disequilibrium (LD) arising from close familial relationships in the studied individuals. Long range LD makes it difficult to distinguish if one or multiple quantitative trait loci (QTL) are segregating in a genomic region showing association with a phenotype. We had two objectives in this study: 1) to distinguish between multiple QTL segregating in a genomic region, and 2) use of external information to prioritize candidate genes for a QTL along with the candidate variant.
RESULTS: We observed fixing the lead SNP as a covariate can help to distinguish additional close association signal(s). Thereafter, using the mammalian phenotype database, we successfully found candidate genes, in concordance with previous studies, demonstrating the power of this strategy. Secondly, we used variant annotation information to search for causative variants in our candidate genes. The variant information successfully identified known causal mutations and showed the potential to pinpoint the causative mutation(s) which are located in coding regions.
CONCLUSIONS: Our approach can distinguish multiple QTL segregating on the same chromosome in a single analysis without manual input. Moreover, utilizing information from the mammalian phenotype database and variant effect predictor as post-GWAS analysis could benefit in candidate genes and causative mutations finding in cattle. Our study not only identified additional candidate genes for milk traits, but also can serve as a routine method for GWAS in dairy cattle.

Entities:  

Keywords:  Candidate genes; Closely linked association signals; Dairy cattle; GWAS; Milk traits

Year:  2018        PMID: 29751743      PMCID: PMC5948690          DOI: 10.1186/s12863-018-0620-0

Source DB:  PubMed          Journal:  BMC Genet        ISSN: 1471-2156            Impact factor:   2.797


Background

Over the last decade, the development of high density single nucleotide polymorphism (SNP) arrays and next-generation sequencing (NGS) technologies have made genome wide marker sets available in many organisms [1, 2]. Combining these with phenotypic records on many individuals, genome wide associate studies (GWAS) present a powerful tool to undercover genetic variants associated with variation in the phenotype [3]. In human, numerous studies successfully identified causal variants for traits such as height [4], bodyweight [5] as well as several complex diseases [6]. However, in livestock, long range linkage disequilibrium typically results in imprecise determination of quantitative trait loci (QTL) locations and the associated genomic regions containing several positional candidate genes. In addition, two or more QTL located close to each other may be misidentified as one QTL. In such situations, additional analyses need to be performed to distinguish multiple QTL located close to each other. To resolve these issues, we need additional information over and above association statistics. For traits with Mendelian inheritance, techniques such as homozygosity mapping and studies of recombinant haplotypes provide important clues due to the unambiguous association of at least some genotypes with phenotypic differences. For quantitative traits, no such close associations exist. However, genomic information of various types do allow relative prioritization among candidate variants. The challenges are which information to consider post-GWAS and how to combine them with GWAS statistics. Expression quantitative trait loci (eQTL) mapping can help; expression profiles as the dependent trait in a GWAS have identified causal genes in some studies [7]. Nevertheless, eQTL studies are time consuming and expensive. Therefore, alternative approaches to incorporate gene expression data into GWAS are needed. Other sources of additional information like variants’ annotation [8] and evolutionary conservation scores [9] have been used. Unfortunately, these analyses need to be designed on a case-by-case basis [10]. Their implementation is challenging in livestock due to the sparsity of annotation data. In this study, we used an approach to separate multiple closely linked QTL in dairy cattle by fixing the lead SNP as a covariate. The analysis detects QTL chromosome by chromosome, and generates a list of lead SNPs for each QTL. The method is demonstrated by application to three milk yield traits in Nordic Holstein cattle. Many previously identified loci were also confirmed here. Furthermore, we used the mammalian phenotype database to help find the candidate genes and Variant Effect Predictor (VEP) annotations to screen for possible causal mutations.

Results

We applied a GWAS analysis approach that automatically and iteratively accounts for the effects of QTL identified in previous iteration(s), a similar approach to conditional analysis implemented in GCTA [11]. The impact of pre-correction on type I error rate was assessed by analyzing simulated data with the effect of a quantitative trait nucleotide (QTN) added to the real phenotypic data (for details on simulation, see Method section). The candidate genes were picked as the closest genes to the lead SNP and listed in Tables 1, 2 and 3. The search for candidate genes started with the top SNP location. However, the whole genomic region showing strong associations with the trait was searched, as the top SNP may not be always located closest to the causal gene due to differences in: LD, imputation accuracy and minor allele frequency. Therefore, we included discussion on other relevant genes (based on association results, known gene function etc.) which could be candidate genes underlying the QTL.
Table 1

Lead SNPs from genome-wide associated regions for fat yield in Nordic Holstein cattle. Base positions are given as position in UMD 3.1.1 [49]

BTAbase positionImputation accuracyEffect–log10(p)RegionGeneAnnotation
21269798820.9972−1.3111.46126041707~ 127230070PIGV (near)Downstream
285991577b0.95421.308.9185042155~ 86241732 ANKRD44 Intron
372263900.9998−1.099.016264604~ 7476473 NOS1AP Intron
5939483570.99063.2862.4193698481~ 94198475 MGST1 Intron
520284735b0.9692−1.309.7920035379~ 205347795S_rRNAintergenic
6954979330.9996−1.4514.7695248213~ 95747954 PAQR3 intergenic
632950721b0.49756.3311.3932367171~ 33200834ENSBTAG00000047255Intron
7572879900.8807−1.6620.1157038213~ 57538309 KCTD16 Intron
9387151370.9809−1.478.8938345408~ 38965425 LAMA4 Intron
11887714490.98761.1610.4388521462~ 89021477ENSBTAG00000047976intergenic
1115323223b0.8962−1.329.8114855568~ 15573444 TTC27 Intron
1155681193c0.9948−1.609.9155423855~ 55931229 CTNNA2 Intron
12689657580.9957−1.108.9368502223~ 69216445ENSBTAG00000045195intergenic
14a18022650.9398−6.93240.561549133~ 2049435 DGAT1 missense
14a18022660.9362−6.93240.561549133~ 2049435 DGAT1 missense
15658911000.99921.5012.9965363764~ 66141839 ELF5 intergenic
1525044706b0.9908−1.179.8024795472~ 25295470 ZBTB16 Intron
17625431600.98981.1410.4962224291~ 62793298 TBX5 Intron
18189705510.9442−1.1910.3018341203~ 19220732 NKD1 intergenic
19275229270.8500−1.3210.8626625240~ 27773016 ASGR1 intergenic
20226097360.98131.5314.2321664412~ 22859809 MAP3K1 intergenic
26205474450.9993−1.7621.4620297497~ 20797570 COX15 Intron
2642408595b0.9998−1.2110.3041409014~ 42658925 TACC2 Intron
29236094120.77172.0610.7322613737~ 23859451ENSBTAG00000047094intergenic
Total number of significant SNPs54,435

a Fourteen additional SNPs on chromosome 14 located near DGAT1 gene had same highest P value (details on those not presented). b indicated this SNP was found on second round, c indicated this SNP was found on third round

Table 2

Lead SNPs from genome-wide associated regions for protein yield in Nordic Holstein cattle. Base positions are given as position in UMD 3.1.1 [49]

BTAbase positionImputation accuracyEffect–log10(p)RegiongeneAnnotation
1631779470.9885−1.9412.3561838881~ 63178271ENSBTAG00000046854 (near)intergenic
21248376690.98861.5912.63124834921~ 124837676 PTPRU Intron
3171605210.9717−1.158.7615377852~ 17160986S100A12 (near)Upstream
5935118260.8626−1.3714.2593087740~ 93511841LMO3 (near)intergenic
521792183a0.9813−1.3710.3920072875~ 21792190SNORD107 (near)intergenic
587923795b0.99261.508.9785996611~ 87924188ETNK1 (near)intergenic
6884775010.9962−2.6025.9887154594~ 88477524 SLC4A4 Intron
648477272a0.73291.4913.5946907022~ 48477298ENSBTAG00000045570 (near)intergenic
688749792b0.3222−1.6612.1386831016~ 88749854GC (near)intergenic
7413729890.9999−1.5418.1441085164~ 41373119MGAT1 (near)intergenic
772100619a0.90771.5913.2970118741~ 72100721EBF1 (near)intergenic
8930657870.85731.6510.0791065857~ 93066321 GRIN3A Intron
831538155a1.00001.919.6230388755~ 31538625LURAP1L (near)intergenic
9332678550.8655−1.4611.9632627954~ 33267856SLC35F1 (near)intergenic
10939333040.8370−1.369.9092043775~ 93933391 SEL1L Intron
11355127080.9999−1.4511.8233534073~ 35512724ENSBTAG00000027786 (near)intergenic
13372087920.9279−1.6910.9035572625~ 37208843MKX (near)intergenic
1377657858a0.99061.179.5276677111~ 77908632 PREX1 intron
1418354400.74712.8448.661448510~ 1836166 BOP1 Intron
1467981742a0.76521.7811.6065984180~ 67981772STK3 (near)intergenic
16322629830.9290−1.5212.7930519873~ 32263130 SMYD3 Intron
18570154070.97542.5617.7155015676~ 57015473 POLD1 Intron
1815272231a0.6697−1.169.5315032157~ 15272234SNORA11(near)downstream
19275229270.8500−1.4212.5526422519~ 27522980RNASEK (near)downstream
1961014793a0.8505−1.088.6560995058~ 61014874KCNJ2 (near)intergenic
20690066090.9920−1.2911.2768120719~ 69006661IRX1 (near)intergenic
209282667a0.67471.7110.937765154~ 9282923 MRPS27 intron
23109749680.9304−1.1810.689127211~ 10975139FGD2 (near)intergenic
25364037191.00001.3310.2536112575~ 36403849EPO (near)intergenic
26376954940.9122−1.4114.7636684176~ 37695588KIAA1598 (near)intergenic
27363049780.98341.068.5235875452~ 36305040 ANK1 intron
29176206170.95761.4710.3715650574~ 17620644 NARS2 intron
2935459126a0.99991.6110.1133464929~ 35459620 NTM intron
Total number of significant SNPs38,439

aindicated this SNP was found on second round, b indicated this SNP was found on third round

Table 3

Lead SNP from genome-wide associated regions for milk yield in Nordic Holstein cattle. Base positions are given as position in UMD 3.1.1 [49]

BTAbase positionImputation accuracyEffect–log10(p)RegionGeneAnnotation
2807538950.94541.139.9579587952~ 80754103NABP1 (near)intergenic
3564029590.9308−1.3611.6856392727~ 56402961ENSBTAG00000001873 (near)intergenic
41015476440.7008−1.6612.65100921921~ 101547648CHRM2 (near)upstream
5939534870.9726−2.1029.5291953587~ 93953619MGST1 (near)upstream
523022794a0.76171.3812.9321101742~ 23022797EEA1 (near)intergenic
585080296b0.7619−1.2811.2484425435~ 85080331ENSBTAG00000009778 (near)intergenic
6888475950.9009−1.7821.6188612186~ 88847596GC (near)intergenic
646901490a0.7413−1.2811.4546181675~ 46901554SEL1L3 (near)intergenic
638027010b0.9950−4.759.4736909885~ 38027583 ABCG2 missense
7653708500.9848−1.3613.5865256765~ 65370922GLRA1 (near)intergenic
8738778140.8453−1.3711.1471877875~ 73877845ENSBTAG00000010829 (near)upstream
842062591a0.9595−1.2710.0740245362~ 42062776 KCNV2 intergenic
9334785270.8801−1.259.2331790030~ 33478670NUS1 (near)intergenic
1019899070.9469−1.159.92448434~ 1990092ENSBTAG00000047622 (near)intergenic
13368223300.9933−1.6610.7436663680~ 36822395 MPP7 Intron
14#18026670.79755.98178.351702853~ 1797137 DGAT1 Intron
1467577503a0.8898−2.1611.0466624772~  67828111 OSR2 intergenic
15543926110.95771.5716.5852771707~ 54393036 PPME1 Intron
16283842600.99841.6410.5028012864~ 28384605CNIH3 (near)Intergenic
17665102240.94381.8311.6366119023~ 66510712 CORO1C Intron
18465833460.98291.8611.9744583383~ 46583963UPK1A (near)upstream
19274424520.7904−1.269.7126592355~ 27442492bta-mir-497 (near)upstream
20299967190.9580−2.9531.0227997007~ 29996870MRPS30 (near)intergenic
23250764720.9797−1.349.2323690289~ 25076491 GCM1 Intron
26377164200.9790−1.4312.2836730021~  37966463 TRPA1 intergenic
28349723770.9991−1.299.8133464705~ 34972672 ZMIZ1 intergenic
Total number of significant SNPs57,808

#Eight additional SNPs on chromosome 14 had same highest P value. a indicated this SNP was found on second round, b indicated this SNP was found on third round

Lead SNPs from genome-wide associated regions for fat yield in Nordic Holstein cattle. Base positions are given as position in UMD 3.1.1 [49] a Fourteen additional SNPs on chromosome 14 located near DGAT1 gene had same highest P value (details on those not presented). b indicated this SNP was found on second round, c indicated this SNP was found on third round Lead SNPs from genome-wide associated regions for protein yield in Nordic Holstein cattle. Base positions are given as position in UMD 3.1.1 [49] aindicated this SNP was found on second round, b indicated this SNP was found on third round Lead SNP from genome-wide associated regions for milk yield in Nordic Holstein cattle. Base positions are given as position in UMD 3.1.1 [49] #Eight additional SNPs on chromosome 14 had same highest P value. a indicated this SNP was found on second round, b indicated this SNP was found on third round Our approach of including associated SNPs as covariates in subsequent rounds of analyses did not increase the type I error rates. We simulated one SNP as a QTN and considered ten other SNPs with different levels of LD (r2) with the QTN in order to test whether our method introduces type I error into analysis when fixing lead SNPs detected in previous iterations as covariates [12]. We generated new phenotypes from the real phenotypic value plus the simulated QTN effects. The QTN’s contribution to individuals’ phenotypes was obtained by multiplying the genotype dosage of the QTN with the allele substitution effect which was drawn from a normal distribution with a mean 20% of the standard deviation (SD) of the phenotype and variance as 1% of the phenotypic variance. The simulation was repelicated 100 times. We detected the simulated QTN as the lead SNP in the first round of all 100 replicates. When the simulated QTN was included in the model as a covariate, we did not observed any of the 10 SNPs in LD with QTN to be significant (i.e., no false positives detected).

The GWAS of fat yield

Analyzing milk fat yield, our approach detected seven additional QTL over and above the QTL detected in the first round (Fig. 1 and Table 1). In Table 1, the first SNP on each chromosome is the lead SNP from the first round of GWAS analysis, the rest are the additional SNP(s) detected on a chromosome. Sixteen SNPs on chromosome 14 have the same P-value in the first round, and these SNPs are in high LD with the two known causative polymorphisms in DGAT1 [13], BTA14: 1802265 (rs109234250) and BTA14: 1802266 (rs109326954) (Additional file 1: Figure S1). The variant effect predictor (VEP) [14] annotation showed these two variants in DGAT1 are missense mutations. The second strongest association signal was located on chromosome 5 with lead SNP, BTA5: 93948357 (rs209372883) located within the intron of MGST1. MGST1 was previously reported associated with the milk fat content [15]. On chromosome 26, our lead SNP pointed to COX15. In a human study, this gene was proposed involved in biosynthesis of heme A [16]. Even though this gene is a promising positional candidate gene, no biological information currently links this gene to milk fat yield. Another gene known to affect milk fat content is SCD1 [17] located at chromosome 26: 21141592 ~ 21,148,318. Our lead SNP on chromosome 26 (BTA26:20547445, rs136702635) is located close to it. We estimated the variance explained by QTL. The QTL (16 QTL) found from the first round explained 22.77% of the variance of de-regressed proof breeding value (DRP) for fat yield and all QTL (23 QTL) explained 25.12% of the DRP variance (Table 4).
Fig. 1

Manhattan plot for association of SNP with fat yield in Nordic Holstein cattle. Red horizontal line indicates genome-wide significance level [−log10(P) = 8.5]

Table 4

The genetics variants explained by QTL and the rest of SNPs

Number of QTLV(G1)/Vpb (%)bV(G2)/Vpc (%)c
Fat1a1622.7762.59
Fat2a2325.1260.01
Prot1a2110.8574.05
Prot2a3315.3468.89
Milk1a2018.8566.67
Milk2a2621.2963.97

a Fat means the trait of fat yield, Prot means the trait of protein yield, Milk means the trait of milk yield; 1 indicate the lead SNP list only included the lead SNP from the first round, 2 indicated the lead SNP list included all lead SNP found by our approach.b means the percentage of genetics variants explained by the QTL, c means the percentage of genetics variants explained by the rest of SNP other than QTL

Manhattan plot for association of SNP with fat yield in Nordic Holstein cattle. Red horizontal line indicates genome-wide significance level [−log10(P) = 8.5] The genetics variants explained by QTL and the rest of SNPs a Fat means the trait of fat yield, Prot means the trait of protein yield, Milk means the trait of milk yield; 1 indicate the lead SNP list only included the lead SNP from the first round, 2 indicated the lead SNP list included all lead SNP found by our approach.b means the percentage of genetics variants explained by the QTL, c means the percentage of genetics variants explained by the rest of SNP other than QTL

The GWAS for protein yield

We ran the analysis on the milk protein yield (Fig. 2), and found 33 lead SNPs (Table 2), 12 of which were detected in the second or third round. The strongest association signal for protein yield was on BTA14 with lead SNP BTA14:1835440 (rs208567981), located within BOP1. The annotation of BTA14:1835440 (rs208567981) is a missense mutation, and the SIFT annotation is tolerant. However, this signal is most likely due to the known mutation in DGAT1. The lead SNP (rs208567981) was in strong LD with SNPs located within DGAT1 and the –log10(P) value of these 19 SNPs within DGAT1 were larger than 47.99 (including two known causative variants in DGAT1, Additional file 1: Figure S2). This result shows that the causal mutation may not necessarily be the SNP in highest association. The second lead SNP of this analysis is BTA6: 88477501, which is located near the well-studied casein genes CSN1S1, CSN1S2, CSN3 and CSN2 [18]. We estimated the variance explained by QTL. The QTL (21 QTL) found only from the first round explained 10.85% of the DRP variance for protein yield and all QTL (33 QTL) explained 15.34% of the DRP variance (Table 4).
Fig. 2

Manhattan plot for association of SNP with protein yield in Nordic Holstein cattle. Red horizontal line indicates genome-wide significance level [−log10(P) = 8.5]

Manhattan plot for association of SNP with protein yield in Nordic Holstein cattle. Red horizontal line indicates genome-wide significance level [−log10(P) = 8.5]

The GWAS for milk yield

We applied our analysis to milk yield (Fig. 3). A total of 26 lead SNPs (Table 3) were detected, out of which six were detected in the second or third round. The most significant association signal was in the DGAT1 gene. The second most significant association signal was at BTA20:29996719 (rs43116343), which is close to MRPS30. A recent study showed MRPS30 to be associated with lactation persistence in Canadian Holstein cattle [19]. This lead SNP is also close to the growth hormone receptor, GHR [20]. The causative mutation of GHR is BTA20:31909478 (rs385640152), and is known to affect milk yield [20]. The third strongest lead SNP was BTA5:93953487 (rs210234664). This SNP is close to MGST1. A previous eQTL study showed MGST1 may affect milk composition [21]. With our approach, we detected BTA6: 38027010 (rs43702337) in the third round, located in ABCG2. ABCG2 was previously reported to affect milk yield in dairy cattle [22]. This SNP is a missense variant; its SIFT annotation is “deleterious” and has previously been proposed as a causative mutation [23]. We estimated the variance explained by QTL. The QTL (20 QTL) found from the first round explained 18.85% of the DRP variance for milk yield and all QTL (26 QTL) explained 21.29% of the phenotypic variance (Table 4).
Fig. 3

Manhattan plot for association of SNP with milk yield in Nordic Holstein cattle. Red horizontal line indicates genome-wide significance level [−log10(P) = 8.5]

Manhattan plot for association of SNP with milk yield in Nordic Holstein cattle. Red horizontal line indicates genome-wide significance level [−log10(P) = 8.5]

Post-GWAS analysis using the mammalian phenotype database

The criteria for selecting positional candidate genes was the gene located closest to the lead SNP. For future identification and research on genes biologically associated with milk traits, we tried to find whether there are other genes which should be considered as potential candidate genes other than the candidate gene list (Tables 1-3). Considering the high LD structure of cattle population, the causal genes may be located within the genome region in LD with lead SNPs. One source of additional information that may help to prioritize genes, is to find the link between the gene and the possible function in the mammalian phenotype database related to milk and milk-organ related traits [24]. Therefore, we extracted genes which overlap with the LD region of the lead SNP and search them in the mammalian phenotype database [24]. We only paid attention to two kinds of phenotypes: “abnormal mammary gland development” or “abnormal milk composition”. Eight genes from the GWAS hits were also annotated as related to these two types of phenotype. This annotation appears to have biological relevance, although the enrichment of these 8 genes in the mammalian phenotype database analyzed by Fishers’ exact test was not significant. The results showed four genes were reported to be related with “abnormal milk composition” (Table 5). Out of this list, CSN1S1, CSN2, CSN3 and DGAT1 were reported in dairy cattle and also identified in the present study. Furthermore, we identified five genes related to “abnormal in mammary gland development” (Table 6) in mammalian phenotype database. In this list DGAT1 showed abnormal phenotype in both kinds of phenotype description we searched. In addition to the well-studied genes (CSN1S1, CSN2, CSN3 and DGAT1), the remaining four genes are ELF5, CAT, STK3 and CHUK. ELF5 is one of the candidate genes proposed by the closest genes to lead SNP (BTA15: 65891100) associated with the fat yield (Table 1). ELF5 was previously found related to mouse mammary development [25] and may also influence the milk content through milk protein synthesis in cattle [26]. CAT is also located close to the same lead SNP as ELF5. CAT is involved in several biological processes including GO term ‘responds to fatty acid’ [27]. CHUK, close to BTA26: 20547445, is associated with fat yield (Table 1). This gene is known as a key gene involved in mammary development in mice [28]. STK3 is the nearest gene to the second lead SNP (BTA14: 67981742) on the same chromosome associated with milk protein yield (Table 2). This gene was found to play a pivot role in controlling cell proliferation [29] and tumor suppression [30] in human studies.
Table 5

Genes related to “abnormal milk composition” phenotype in the mammalian phenotype database [24] overlapped with milk QTL identified in the present study

Gene nameLocationPhenotype
CSN1S1 BTA6: 87,141,556–87,159,096abnormal milk composition
CSN2 BTA6: 87,179,502–87,188,025abnormal milk composition
CSN3 BTA6: 87,378,398–87,392,750abnormal milk composition
DGAT1 BTA14: 1,795,351–1,804,562abnormal milk composition
Table 6

Genes related to “abnormal of mammary gland development” in the mammalian phenotype database [24] overlapped with milk QTL identified in the present study

Gene nameLocationPhenotype
CAT BTA15: 65,779,325–65,815,261decreased mammary gland tumor incidence
ELF5 BTA15: 65,824,442–65,854,386abnormal mammary gland development
STK3 BTA14: 67,677,676–67,987,801increased mammary gland tumor incidence
DGAT1 BTA14: 1,795,351–1,804,562abnormal mammary gland development
CHUK BTA26: 20,966,010–21,008,277abnormal mammary gland growth during pregnancy
Genes related to “abnormal milk composition” phenotype in the mammalian phenotype database [24] overlapped with milk QTL identified in the present study Genes related to “abnormal of mammary gland development” in the mammalian phenotype database [24] overlapped with milk QTL identified in the present study

Annotation of SNPs in LD with lead SNPs

As shown before, the causative mutation maybe located in the neighboring region of the lead SNP. Therefore, we extracted all SNPs in LD with leading SNPs (r2 > 0.2) and annotated them using VEP [14]. We extracted 190,130 SNPs and obtained 203,120 annotations (because some genes or transcripts overlap). The majority of these SNPs are intergenic variants or intron variants (Fig. 4a). Among the SNPs that changed the coding sequence of the protein, most of them were synonymous variants (Fig. 4b). Using this result, we checked if we could prioritize candidate mutations in the candidate genes. For example GHR, the well-known causative mutation for GHR is BTA20:31909478 (rs385640152, F279Y) [20]. The annotation for this SNP is a missense mutation and the SIFT score is 0.02 which is ‘deleterious’.
Fig. 4

The VEP annotation of SNPs in linkage disequilibrium (LD > 0.20) with leading SNPs. a The summary of all annotation. b The summary of annotation that change the protein coding sequence

The VEP annotation of SNPs in linkage disequilibrium (LD > 0.20) with leading SNPs. a The summary of all annotation. b The summary of annotation that change the protein coding sequence Further, we checked whether we can detect some candidate mutations in the new candidate genes. Four genes (CSN1S1, CSN2, CSN3 and DGAT1) were found related to abnormal milk composition and DGAT1 related to mammary gland development (Table 5 and Table 6) as reported previously. In addition to DGAT1, we found several tolerance missense mutations in CSN1S1 and CSN2. In CSN3, we found three tolerance missense mutations and one deleterious mutation (BTA6:87390632, rs43703017). Moreover, in STK3, we also found tolerance missense mutations.

Discussion

Although functional gene clustering is weaker in eukaryotes genomes than in prokaryotes genomes, functional grouping of the genes with same or similar function still exists [31]. Therefore, in GWAS analysis, we may fail to detect the nearby genes and may treat them as one significant signal. In our study, we used an analysis approach to detect multiple nearby QTL by iteratively fixing the lead SNP as covariate. However, such an approach can inflate type I error rate [12]. To avoid introducing additional type I errors, we placed a condition that the lead SNPs detected in the second and subsequent rounds must be found to be genome-wide significant in the first round (i.e., significant according to conventional GWAS criteria). In addition, we tested our approach on simulated data with a simulated QTN and multiple SNPs with various levels of LD with the QTN. In 100 replicates, we found no additional SNP in LD with the QTN other than the simulated causative variants. By using this analysis, we were able to detect multiple QTL (as well as designating the lead SNP for each QTL) on a chromosome automatically. For example, we detected a known QTL on BTA6 (BTA6:38027010, rs43702337) in the third round and also another QTL at 46 Mb (in the second round). This SNP is located in the gene ABCG2 which was previously reported to affect milk yield in dairy cattle [22] and this lead SNP was the most probable causative mutation [23]. Furthermore, our approach also showed the potential to distinguish closely linked QTL. For example the lead SNPs on chromosome 6 of protein content, we detect the first association signal at BTA6: 88477501 and the third association signal at BTA6: 88749792. Similar conditional analyses were also applied in human and other livestock studies [32-34]. Here, we analyzed one lead SNP at a time, as opposed to Bolormaa et al. [34] who included all lead SNPs simultaneously in the model. We also compared the genetic variants explain by the QTL found by first round and all the QTL found by our approach. The results showed the QTL found at second and third round did explain more phenotype variants (Table 4). Post GWAS, we face the challenge of identifying the candidate genes. The conventional method is to use the nearest gene, but this may miss the target as many-a-time the lead SNP may not be from the causal gene. This could be due to imputation inaccuracies, multiple QTL in the vicinity or random chance factor. Therefore, we need to use additional information to prioritize the candidate genes. In this study, we used the mammalian phenotype database to search for candidate genes from the genes located in association regions. The mammalian phenotype is based on mouse mutation lines. As a test, we extracted all genes located within LD of the lead SNPs for all three milk yield traits and searched for related phenotype terms. Here, we searched for two phenotype terms ‘abnormal mammary gland development’ and ‘abnormal milk composition’. We successfully identified some well-known genes affecting milk related traits in cattle as well as new candidate genes (Table 5 and Table 6). For the term ‘abnormal milk composition’, we identified four genes. All of which were reported previously in different studies [35, 36], and only DGAT1 is the nearest gene to the lead SNP on chromosome 14. Another term we searched is ‘abnormal in mammary gland development’ and found five genes. CAT and CHUK are not the nearest genes to the lead SNP. However, differences between mice and cattle may introduce some false positives. In all, using this strategy we not only found some well-studied genes missing from the nearest genes method (pick the gene which is nearest to lead SNP as candidate genes), but also identified new candidate genes which may be helpful in finding causal factors. We also face another challenge of identifying the causative variant once the causal gene is identified as levels of linkage disequilibrium in cattle are high [37]. In many cases the causative variant is not the lead SNP [38] but another SNP hidden within the LD of the lead SNP. In human studies, there are different strategies to prioritize variants [10]. In this study, information from Ensembl [14] was used to prioritize potential causative variants. In our case, the DGAT1 and ABCG2 can be detected in our lead SNP list, and the causative mutation of both can be detected in VEP annotation as missense variants. GHR was found nearby the location of lead SNPs. For ABCG2 and GHR, the SIFT score show these mutations as ‘deleterious’. For DGAT1, even though the SIFT showed these two mutations are tolerated the amino acid of the protein is changed. Therefore, the impact of moderate and high reported by VEP can be considered as possible causative mutations, while SIFT score can be used to provide additional support. In summary, our analysis approach can distinguish nearby association signals of multiple QTL. In our study, we found candidate genes reported by previous studies. Followed by searching genes within the LD region of the lead SNPs, we can find high confidence candidate genes. Lastly, using VEP can help us to find putative causative mutations within candidate genes and provides a good source for further functional validation. However, our approach will not be able to pinpoint causal variants located in the non-coding and regulatory regions due to lack of annotation of the cattle genome.

Conclusions

In this study, we designed an approach for detecting closely linked multiple association signals and performed the analysis in Nordic Holstein cattle for milk, fat and protein yields. The results showed we not only detected most of the well-known genes affecting these three milk yield traits but also detected additional candidate genes. Post-GWAS, we used information from the mammalian phenotype database and variant effect predictor to confirm known genes and causative mutations. In the meanwhile, we detected additional genes which might be contributing to variation in milk traits in Nordic Holstein cattle. Therefore, we concluded our approach can be used routinely for GWAS studies in dairy cattle.

Methods

Phenotype and genotype data

No animal experiments were performed in this study, and therefore, approval from the ethics committee was not required. Phenotypic records for Nordic Holstein cattle are kept in a centralized database (Nordic Cattle Genetic Evaluation, NAV. http://www.nordicebv.info/). Breeding values for milk, fat and protein yield (MY, FY and PY) are based on production figures expressed in kilograms taken from routine milk records and then combined into an index for each trait. For details on genetic evaluation for milk yield traits in Nordic countries see (http://www.nordicebv.info/production). The breeding values used for association analysis were de-regressed proof breeding values [39, 40] from the routine genetic evaluation by NAV and were available for 5382 progeny tested Holstein bulls. The association study was carried out by using imputed WGS data, as previously described by Iso-Touru et al. [41] and Wu et al. [42]. A total of 4921 bulls were genotyped with the Illumina BovineSNP50 BeadChip (54 k) ver. 1 or 2 (Illumina, San Diego, CA, USA). The 54 k genotypes were imputed to WGS variants by a 2-step approach [43]. First, all animals were imputed to the high-density (HD) level by using a multibreed reference of 3383 animals (1222 Holsteins, 1326 Nordic Red Dairy Cattle, and 835 Danish Jerseys), which had been genotyped with the Illumina BovineHD BeadChip. Subsequently, these imputed HD genotypes were imputed to the WGS level by using a multibreed reference of 1228 animals from Run4 of the 1000 Bull Genomes Project [1] (1148 cattle, including 288 individuals from the global Holstein population, 56 Nordic Red Dairy Cattle, 61 Jerseys, and 743 cattle from other breeds [1] and additional data from Aarhus University (80 individuals, including 23 Holsteins, 30 Nordic Red Dairy Cattle, and 27 Danish Jerseys). Different variant calling pipelines were used for the 1000 Bull Genome Project data and the in-house Nordic data at Aarhus University. The whole genome sequence data at Aarhus University was analyzed as described by Brøndum et al. [44]; while the same for 1000 Bull Genome Project was described by Daetwyler et al. [1]. Detailed guidelines are available at http://www.1000bullgenomes.com. Data from both sources were available as VCF files. The data from the two sources were combined using Picards MergeVCFs (https://broadinstitute.github.io/picard/). As the 1000 Bull Genomes Project only shares data after variant calling, some markers were not called for all animals in the combined dataset. To avoid large gaps of missing markers in the dataset, only markers that were called in both the Nordic and the 1000 Bull Genomes Project datasets were kept. For positions containing both a SNP and an INDEL, the INDEL was discarded, as the imputation methods rely on unambiguous sequences of variants. Positions with disagreements between alleles for sequence and HD data were also deleted. Reference genotype probability data was run through BEAGLE [45] and all markers with an R2 value (squared correlation between the true and imputed allele dosages) below 0.9 were removed from the original sequence data. This was done in order to remove poorly imputed markers that might have adverse effects on the imputation procedures. Imputation from 54 k to HD genotypes to HD and imputation to the WGS level were undertaken with IMPUTE2 v2.3.1 [46] and Minimac2 [47], respectively. The imputation to whole genome sequence was done in chunks of 5 Mb with the length of buffer region of 0.25 Mb on either side. A total of 22,751,039 biallelic variants were present in the imputed sequence data. After excluding SNP with a minor allele frequency below 1% or with large deviation from Hardy–Weinberg proportions (P < 1.0− 6), 15,355,402 SNPs on 29 autosomes in Nordic Holstein cattle were retained for association analyses. The average accuracy (r2-values from Minimac2) was 0.85 for across breed imputation. Information on the distribution of imputation accuracy as a function of minor allele frequency has previously been published [42].

The methodology of multiple QTL detection

We developed an analysis approach to run the conditional GWAS analysis, similar to the GCTA-COJO approach in GCTA [11]. However, GCTA-COJO uses GWAS summary data while we have reanalyzed the data after fitting only the lead SNP(s) on a chromosome. Furthermore, we used imputed dosage data instead of number of copies of the reference allele. This takes account of inaccuracies in genotype imputation. We first performed a single SNP GWAS analysis using GCTA [11] for each chromosome as the first round. Then we ranked the SNP based on their –log10P value in the GWAS. The SNP with the largest –log10P value, the lead SNP, within each chromosome was identified. An experiment-wise 0.05 type I error rate after Bonferroni correction for 15,335,402 simultaneous tests corresponds to a threshold of –log10P ≈ 8.5. If the –log10P value of the lead SNP exceeded 8.5; we extracted the lead SNP’s genotype dosage, fitted it as a covariate, and scanned the whole chromosome again as the second round. If the result of second round detected another SNP with a –log10P value exceeding 8.5 and this SNP also was significant in the first round (–log10P > 8.5), we extracted the allele dosage of this SNP and fixed it as another covariate and scanned the chromosome in a third round. This same procedure was iterated until no additional SNP remained significant. The lead SNP in each round were collected to build a lead SNP list. Moreover, in each round solo SNP, that is, SNP with no other significant SNP within a 1 Mb region were removed. A boundary for each QTL peak was defined as follows: for each QTL, we scanned the 1 Mb region up- and down-stream of each lead SNP, if SNP –log10P value decreased by more than 3 units compared to the value at the leading SNP and the region is larger than 0.25 Mb we set this SNP as a boundary, otherwise we set ±0.25 Mb as the boundary. The list of candidate genes were generated from the closest annotated genome feature to the lead SNP list.

Testing the type I error rate using simulation data

We used simulated phenotype data to test whether our approach to detecting multiple QTL on a chromosome by incorporating previously identified QTL as covariates, inflates the type I error rates [12]. We selected a SNP randomly from the genome as a causative mutation (QTN) with a MAF (Minor Allele Frequency) between 0.05 and 0.10 and in Hardy Weinberg equilibrium. Ten additional SNP with different levels of LD (linkage disequilibrium, r2) with the simulated QTN were selected. These 10 SNPs have different r2 with the QTN as follows: one with 0.9–1, one with 0.8–0.9, one with 0.8–0.7, one with 0.7–0.6, one with 0.6–0.5, one with 0.5–0.4, one with 0.4–0.3, one with 0.3–0.2 and two with less than 0.2. Allele substitution effects at the QTL were sampled from a univariate normal distribution with mean of 20% of the standard deviation of phenotype and variance equal to 1% of the phenotypic variance. We repeated this simulation and applied our analysis 100 times. Lastly, we investigated how many times we found a SNP in LD with the simulated QTN after we fix the simulated causative mutation as a covariate i.e. false positive detection.

LD calculation and annotation

We calculated the pairwise r2 between lead SNP and all other SNPs on the same chromosome using PLINK [48] and extracted all the SNPs which have r2 > 0.2 with the lead SNP. All these SNPs were annotated by VEP (Variant Effect Predictor) [14]. To find the candidate genes, we extracted all the genes which overlap with LD regions of the lead SNP and searched these gene entries in the Mammalian Phenotype database [24]. We collected all the lead SNPs and calculated the pairwise r2 with SNPs in the chromosome. The boundary was set to the last SNP that has r2 > 0.2. Then we extracted all the genes overlapping these regions and searched them in the database. We found 411 genes located in the LD regions, of which 375 have gene symbols. These 375 genes were searched in the database and 364 have mutation lines with phenotype descriptions in the Mammalian Phenotype database. We refined results using two terms for phenotypes: ‘abnormal in mammary gland development’ and ‘abnormal in milk production’.

The genetics variants explained by QTL

We used the lead SNP list to generate the genetic relationship matrix (GRM) as group 1. Then we excluded all flank 2.5 Mb SNPs of the lead SNP from the imputed HD data to generate GRM as group 2. At last, we estimated variance explained by these two groups for each trait. The whole analysis was conducted using GCTA [11]. Figure S1. The locuszoom figure of previous report causative mutation of DGAT1 of the genome-wide association result in fat content of milk in Nordic Holstein cattle and Figure S2. The locuszoom figure of previous report causative mutation of DGAT1 of the genome-wide association result in fat content of milk in Nordic Holstein cattleBOP1 was not include in USCS refFlat file. (DOCX 551 kb)
  45 in total

1.  Weighting sequence variants based on their annotation increases power of whole-genome association studies.

Authors:  Gardar Sveinbjornsson; Anders Albrechtsen; Florian Zink; Sigurjón A Gudjonsson; Asmundur Oddson; Gísli Másson; Hilma Holm; Augustine Kong; Unnur Thorsteinsdottir; Patrick Sulem; Daniel F Gudbjartsson; Kari Stefansson
Journal:  Nat Genet       Date:  2016-02-08       Impact factor: 38.330

2.  Extensive long-range and nonsyntenic linkage disequilibrium in livestock populations: deconstruction of a conundrum.

Authors:  E Lipkin; K Straus; R Tal Stein; A Bagnato; F Schiavini; L Fontanesi; V Russo; I Medugorac; M Foerster; J Sölkner; M Dolezal; J F Medrano; A Friedmann; M Soller
Journal:  Genetics       Date:  2008-12-15       Impact factor: 4.562

3.  IKKalpha provides an essential link between RANK signaling and cyclin D1 expression during mammary gland development.

Authors:  Y Cao; G Bonizzi; T N Seagroves; F R Greten; R Johnson; E V Schmidt; M Karin
Journal:  Cell       Date:  2001-12-14       Impact factor: 41.582

4.  Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle.

Authors:  Hans D Daetwyler; Aurélien Capitan; Hubert Pausch; Paul Stothard; Rianne van Binsbergen; Rasmus F Brøndum; Xiaoping Liao; Anis Djari; Sabrina C Rodriguez; Cécile Grohs; Diane Esquerré; Olivier Bouchez; Marie-Noëlle Rossignol; Christophe Klopp; Dominique Rocha; Sébastien Fritz; André Eggen; Phil J Bowman; David Coote; Amanda J Chamberlain; Charlotte Anderson; Curt P VanTassell; Ina Hulsegge; Mike E Goddard; Bernt Guldbrandtsen; Mogens S Lund; Roel F Veerkamp; Didier A Boichard; Ruedi Fries; Ben J Hayes
Journal:  Nat Genet       Date:  2014-07-13       Impact factor: 38.330

5.  Genotype imputation with thousands of genomes.

Authors:  Bryan Howie; Jonathan Marchini; Matthew Stephens
Journal:  G3 (Bethesda)       Date:  2011-11-01       Impact factor: 3.154

6.  Multibreed genome wide association can improve precision of mapping causative variants underlying milk production in dairy cattle.

Authors:  Lesley-Ann Raven; Benjamin G Cocks; Ben J Hayes
Journal:  BMC Genomics       Date:  2014-01-24       Impact factor: 3.969

7.  Detailed phenotyping identifies genes with pleiotropic effects on body composition.

Authors:  Sunduimijid Bolormaa; Ben J Hayes; Julius H J van der Werf; David Pethick; Michael E Goddard; Hans D Daetwyler
Journal:  BMC Genomics       Date:  2016-03-12       Impact factor: 3.969

8.  Genomewide association studies and human disease.

Authors:  John Hardy; Andrew Singleton
Journal:  N Engl J Med       Date:  2009-04-15       Impact factor: 91.245

9.  Genetic support for a quantitative trait nucleotide in the ABCG2 gene affecting milk composition of dairy cattle.

Authors:  Hanne Gro Olsen; Heidi Nilsen; Ben Hayes; Paul R Berg; Morten Svendsen; Sigbjørn Lien; Theo Meuwissen
Journal:  BMC Genet       Date:  2007-06-21       Impact factor: 2.797

10.  The Ensembl Variant Effect Predictor.

Authors:  William McLaren; Laurent Gil; Sarah E Hunt; Harpreet Singh Riat; Graham R S Ritchie; Anja Thormann; Paul Flicek; Fiona Cunningham
Journal:  Genome Biol       Date:  2016-06-06       Impact factor: 13.583

View more
  5 in total

1.  Retraction Note: Dissecting closely linked association signals in combination with the mammalian phenotype database can identify candidate genes in dairy cattle.

Authors:  Zexi Cai; Bernt Guldbrandtsen; Mogens Sandø Lund; Goutam Sahana
Journal:  BMC Genet       Date:  2018-12-11       Impact factor: 2.797

2.  Multi-population GWAS and enrichment analyses reveal novel genomic regions and promising candidate genes underlying bovine milk fatty acid composition.

Authors:  G Gebreyesus; A J Buitenhuis; N A Poulsen; M H P W Visker; Q Zhang; H J F van Valenberg; D Sun; H Bovenhuis
Journal:  BMC Genomics       Date:  2019-03-06       Impact factor: 3.969

3.  Genomewide Association Analyses of Lactation Persistency and Milk Production Traits in Holstein Cattle Based on Imputed Whole-Genome Sequence Data.

Authors:  Victor B Pedrosa; Flavio S Schenkel; Shi-Yi Chen; Hinayah R Oliveira; Theresa M Casey; Melkaye G Melka; Luiz F Brito
Journal:  Genes (Basel)       Date:  2021-11-19       Impact factor: 4.096

4.  Prioritizing candidate genes post-GWAS using multiple sources of data for mastitis resistance in dairy cattle.

Authors:  Zexi Cai; Bernt Guldbrandtsen; Mogens Sandø Lund; Goutam Sahana
Journal:  BMC Genomics       Date:  2018-09-06       Impact factor: 3.969

5.  Genomic measures of inbreeding coefficients and genome-wide scan for runs of homozygosity islands in Iranian river buffalo, Bubalus bubalis.

Authors:  Seyed Mohammad Ghoreishifar; Hossein Moradi-Shahrbabak; Mohammad Hossein Fallahi; Ali Jalil Sarghale; Mohammad Moradi-Shahrbabak; Rostam Abdollahi-Arpanahi; Majid Khansefid
Journal:  BMC Genet       Date:  2020-02-10       Impact factor: 2.797

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.