Literature DB >> 31501319

Quantifying the contribution of sequence variants with regulatory and evolutionary significance to 34 bovine complex traits.

Ruidong Xiang^1,2, Irene van den Berg^3,2, Iona M MacLeod², Benjamin J Hayes^2,4, Claire P Prowse-Wilkins^3,2, Min Wang^2,5, Sunduimijid Bolormaa², Zhiqian Liu², Simone J Rochfort^2,5, Coralie M Reich², Brett A Mason², Christy J Vander Jagt², Hans D Daetwyler^2,5, Mogens S Lund⁶, Amanda J Chamberlain², Michael E Goddard^3,2.

Abstract

Many genome variants shaping mammalian phenotype are hypothesized to regulate gene transcription and/or to be under selection. However, most of the evidence to support this hypothesis comes from human studies. Systematic evidence for regulatory and evolutionary signals contributing to complex traits in a different mammalian model is needed. Sequence variants associated with gene expression (expression quantitative trait loci [eQTLs]) and concentration of metabolites (metabolic quantitative trait loci [mQTLs]) and under histone-modification marks in several tissues were discovered from multiomics data of over 400 cattle. Variants under selection and evolutionary constraint were identified using genome databases of multiple species. These analyses defined 30 sets of variants, and for each set, we estimated the genetic variance the set explained across 34 complex traits in 11,923 bulls and 32,347 cows with 17,669,372 imputed variants. The per-variant trait heritability of these sets across traits was highly consistent (r > 0.94) between bulls and cows. Based on the per-variant heritability, conserved sites across 100 vertebrate species and mQTLs ranked the highest, followed by eQTLs, young variants, those under histone-modification marks, and selection signatures. From these results, we defined a Functional-And-Evolutionary Trait Heritability (FAETH) score indicating the functionality and predicted heritability of each variant. In additional 7,551 cattle, the high FAETH-ranking variants had significantly increased genetic variances and genomic prediction accuracies in 3 production traits compared to the low FAETH-ranking variants. The FAETH framework combines the information of gene regulation, evolution, and trait heritability to rank variants, and the publicly available FAETH data provide a set of biological priors for cattle genomic selection worldwide.

Entities: Chemical Disease Gene Species

Keywords: animal breeding; cattle; evolution; gene regulation; quantitative traits

Mesh：

Year: 2019 PMID： 31501319 PMCID： PMC6765237 DOI： 10.1073/pnas.1904159116

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 11.205

Understanding how mutations lead to phenotypic variation is a fundamental goal of genomics. With a few exceptions, complex traits with significance in evolution, medicine, and agriculture are determined by many mutations and environmental effects. Genome-wide association studies (GWASs) have been successful in finding associations between single-nucleotide polymorphisms (SNPs) and complex traits (1). Usually, there are many variants, each of small effect, which contribute to trait variation. Consequently, very large sample size is needed to find significant associations that explain most of the observed genetic variation. In humans, the sample size has reached over 1 million (2). To test the generality of the findings in humans, it is desirable to have another species with very large sample size, and cattle is a possible example. There are over 1.46 billion cattle worldwide (3), and millions are being genotyped or sequenced as well as phenotyped (4, 5). Cattle have been domesticated from 2 subspecies of the humpless taurine (Bos taurus) and humped zebu (Bos indicus), which diverged ∼0.5 million years ago from extinct wild aurochs (Bos primigenius) (6). The increasing amount of genomic data and an outbred genome make cattle the only comparable GWAS model to humans. In addition, cattle have a very different demographic history than humans. While humans went through an evolutionary bottleneck about 10,000 to 20,000 y ago and then expanded to a population of billions, cattle have declined in effective population size due to domestication and breed formation, leading to a different pattern of linkage disequilibrium (LD) to humans. Insights into the genome–phenome relationships from cattle provide a valuable addition to the knowledge for other mammals. The knowledge of cattle genomics is also of direct practical value as rearing cattle is a major agricultural industry worldwide. Despite the huge sample sizes used in human GWASs, identification of the causal variants for a complex trait is still difficult. This is due to the small effect size of most causal variants and the LD between variants. Consequently, there are usually many variants in high LD, any one of which could be the cause of the variation in phenotype. Prioritization of these variants can be aided by functional information on genomic sites. For instance, mutations that change an amino acid are more likely to affect phenotype than synonymous mutations. Many mutations affecting complex traits regulate gene transcription-related activities. This has been demonstrated in many studies of human genomics, including but not limited to the analysis of intermediate trait quantitative trait loci (QTLs), such as metabolic QTLs (mQTLs) (7) and expression QTLs (eQTLs) (8) and analysis of regulatory elements, such as promoters (9) and enhancers (10), which can be identified with chromatin immunoprecipitation sequencing (ChIP-seq). In animals, the Functional Annotation of Animal Genomes (FAANG) project has started (11), and animal functional data have been accumulating (12–14). However, it is unclear which types of functional information improve the identification of causal mutations. Mutations affecting complex traits may be subject to natural or artificial selection, which leaves a “signature” in the genome (15, 16). Given the unique evolutionary path of cattle, which has been significantly shaped by human domestication (17), it is attractive to test whether variants showing signatures of selection contribute to variation in complex traits. Mutations within genomic sites that are conserved across species may also affect complex traits. A previous study in humans showed that among a number of functional annotations, conserved sites across 29 mammals had the strongest enrichment of heritability in 17 complex traits (18). We aim to determine which of several possible indicators of function are most useful for predicting sequence variants that are most likely to affect 34 traits in B. taurus dairy cattle. The indicators considered fall into 3 groups: 1) functional annotations of the bovine genome based, for instance, on ChIP-seq experiments; 2) evolutionary data, such as a site being under selection; and 3) GWAS data from traits that are relatively close to the primary action of the mutation, such as gene expression. Using these indicators of function, we define 30 sets of variants and estimate the variance explained by each set across 34 traits in 44,270 cattle. We then combine the estimates of heritability per variant across traits and across functional and evolutionary categories to define a Functional-And-Evolutionary Trait Heritability (FAETH) score that ranks variants on variance explained in complex traits. We validate the FAETH score in an independent dataset of 7,551 Danish cattle. The FAETH score of over 17 million variants with detailed user instructions is publicly available at https://doi.org/10.26188/5c5617c01383b (19). A tutorial demonstrating the calculation of the FAETH score along with demo data and R scripts can be found at https://ruidongxiang.com/2019/07/19/calculation-of-faeth-score-2/.

Results

Analysis Overview.

Our approach was to estimate the trait variance explained by a set of variants defined by some external data, such as the mapping of the gene expression QTLs (geQTLs), RNA splicing QTLs (sQTLs), or genome annotation, for 34 traits measured in dairy cattle. Sequence variants available to this study included over 17 million SNPs and indels. Any large set of variants can explain almost all of the genetic variance due to the LD between surrounding and causal variants. Therefore, we fitted each externally defined set of variants in a model together with a standard set of 630,000 SNPs from the bovine high-density (HD) SNP array. We combined the results from all 34 traits and all sets of variants to derive a score for each variant based on its expected contribution to the genetic variance in these 34 traits and tested the validity of this score in an independent cattle dataset. Our analysis had 4 major steps (Fig. 1).

Fig. 1.

Overview of the analysis. The discovery analysis involved the selection of variants from functional and evolutionary datasets; this figure shows examples of some of the datasets used. In the test analysis, each of the variant sets was used to make GRMs. Then, each one was analyzed in the GREML (gGi), together with the high-density SNP chip GRM (gGHD) for each of the 34 traits (Yj, ). Once the heritability, , of each gGi was calculated, it was averaged across traits and adjusted for the number of variants used to build the gGi to calculate the per-variant . The FAETH scoring of each variant was derived based on their memberships to differentially partitioned sets and the per-variant . In the validation analysis, variants with high and low FAETH ranking were tested in a Danish cattle dataset for GREML and genomic prediction of 3 production traits. The Australian test dataset contained 9,739 bulls and 22,899 cows of Holstein breed, 2,059 bulls and 6,174 cows of Jersey, 2,850 cows of mixed breeds, and 125 bulls and 424 cows of Australian Red. The Danish reference set contained 4,911 Holstein, 957 Jersey, and 745 Danish Red bulls, and the Danish validation population contained 500 Holstein, 517 Jersey, and 192 Danish Red bulls.

The 17 million sequence variants (1000 Bull Genomes Run6) (20) were classified according to external information from the discovery analysis of the function and evolution of each genomic site. The basis for this classification was either publicly available data or our own data as described in . The genome was partitioned 15 different ways as listed in Table 1. For example, the category of geQTL partitioned the genome variants into a set of targeted variants with geQTL P value < 0.0001 and a set of remaining variants (i.e., the “rest” of the variants). Another partition, e.g., variant annotation, based on a publicly available annotation of the bovine genome, divided variants into several nonoverlapping sets, such as “intergenic,” “intron,” and “splice sites.”

Table 1.

Variant sets selected from functional and evolutionary partitions

Partitions	Targeted variant sets (no. of variants)	Animal no.
Gene expression QTLs	geQTLs with metaanalysis P < 1e⁻⁴ from blood and milk cells, liver, and muscle (110,200)	209
Exon expression QTLs	eeQTLs with metaanalysis P < 1e⁻⁴ from blood and milk cells, liver, and muscle (945,832)	209
Splicing QTLs	sQTLs with metaanalysis P < 1e⁻⁴ from blood and milk cells, liver, and muscle (1,112,324)	209
Allele specific expression QTLs	aseQTLs with metaanalysis P < 1e⁻⁴ from blood and milk cells (1,100,446)	112
Polar lipid metabolite QTLs	mQTLs with metaanalysis P < 1e⁻⁴ from 19 types of milk metabolites (5,365)	338
ChIP-seq peaks	Under H3K4Me3 and H3K27Ac peaks from liver, muscle, and mammary gland (1,166,795)	15
Variant annotation	Annotated as UTR (42,350), intergenic (11,869,145), gene end (1,007,214), intron (4,629,025), splice.sites (11,080), coding.related (105,969), and noncoding.related (4,589)	na
Predicted CTCF sites	Variants tagged by mapped CTCF-binding motifs from humans, mice, dogs, and macaques as published in ref. 32 (252,234)	na
HPRS	Genome sites within the top 1% gkm SVM score from the HPRS as published in ref. 31 (169,773)	na
Conserved 100 species	Bovine genome sites lifted over from human sites with PhastCon score (34) > 0.9 calculated using genomes of 100 vertebrate species (378,301)	na
Selection signature	GWAS P < 1e⁻⁴ between 7 beef and 8 dairy breeds, 1000 Bull Genome (6,218)	1,370
Young variants	Ranked within the bottom 1% of the proportion of positive correlations (PPRR) with rare variants, 1000 Bull Genome (893,986)	2,330
LD score quartiles	First quartile (4,417,033/4,416,205), second quartile (4,418,731/4,419,930), third quartile (4,415,633/4,415,481), and fourth quartile (4,417,975/4,417,756)	44,270
Variant density quartiles	First quartile (4,429,833), second quartile (4,414,996), third quartile (4,427,220), and fourth quartile (4,397,323)
MAF quartiles	First quartile (4,414,292/4,417,036), second quartile (4,421,093/4,417,428), third quartile (4,416,834/4,418,157), and fourth quartile (4,417,153/4,418,157)

For the 3 categories of quartiles, the numbers of variants on the left and right side of the slash were for the bulls and cows, respectively. LD score indicates the sum of linkage disequilibrium correlation between a variant and all variants in the surrounding 50-kb region, GCTA-LDS (38). The details of the variant annotations can be found in . The animal numbers are the sample size in each discovery analysis. Fourth quartile scores > third quartile > second quartile > first quartile. na, not applicable.

For each set of variants in each partition of the genome, separate genomic relationship matrices (GRMs) were calculated among the 11,923 bulls or 32,347 cows. Where a partition included only 2 sets (e.g., geQTL and the rest), a GRM was calculated only for the targeted set (e.g., geQTL). For each of the 34 traits, the variance explained by random effects described by each GRM was estimated using restricted maximum likelihood (this analysis is referred to as a genomic REML or GREML). Each GREML analysis fitted a random effect described by the targeted GRM and a random effect described by the GRM calculated from the HD SNP chip (630,002 SNPs). Each GREML analysis estimated the proportion of genetic variance, , explained by the targeted GRM in each of the 34 decorrelated traits (Cholesky orthogonalization) (ref. 21 and ) in each sex. The explained by each targeted set of variants was divided by the number of variants in the set to calculate the per variant, i.e., per-variant , and this was averaged for each variant across the 34 decorrelated traits. The FAETH score of all variants was calculated by averaging the per-variant across traits and informative partitions (13 out of 15). Two partitions determined as not informative were not included in the FAETH score computation. Variance explained and the accuracy of genomic predictions (using an independent dataset of 7,551 Danish cattle with 3 milk production traits) was compared between variants of high and low FAETH score. Overview of the analysis. The discovery analysis involved the selection of variants from functional and evolutionary datasets; this figure shows examples of some of the datasets used. In the test analysis, each of the variant sets was used to make GRMs. Then, each one was analyzed in the GREML (gGi), together with the high-density SNP chip GRM (gGHD) for each of the 34 traits (Yj, ). Once the heritability, , of each gGi was calculated, it was averaged across traits and adjusted for the number of variants used to build the gGi to calculate the per-variant . The FAETH scoring of each variant was derived based on their memberships to differentially partitioned sets and the per-variant . In the validation analysis, variants with high and low FAETH ranking were tested in a Danish cattle dataset for GREML and genomic prediction of 3 production traits. The Australian test dataset contained 9,739 bulls and 22,899 cows of Holstein breed, 2,059 bulls and 6,174 cows of Jersey, 2,850 cows of mixed breeds, and 125 bulls and 424 cows of Australian Red. The Danish reference set contained 4,911 Holstein, 957 Jersey, and 745 Danish Red bulls, and the Danish validation population contained 500 Holstein, 517 Jersey, and 192 Danish Red bulls. Variant sets selected from functional and evolutionary partitions For the 3 categories of quartiles, the numbers of variants on the left and right side of the slash were for the bulls and cows, respectively. LD score indicates the sum of linkage disequilibrium correlation between a variant and all variants in the surrounding 50-kb region, GCTA-LDS (38). The details of the variant annotations can be found in . The animal numbers are the sample size in each discovery analysis. Fourth quartile scores > third quartile > second quartile > first quartile. na, not applicable.

Characteristics of Variant Sets with Regulatory and Evolutionary Significance.

Based on the 15 partitions of the genome in Table 1, we defined 30 sets of variants. The details of the discovery analysis defining these sets can be found in . Briefly, regulatory variant sets including geQTLs, sQTLs, and allele-specific expression QTLs (aseQTLs) were discovered from multiple tissues, including white blood and milk cells, liver, and muscle. The milk cells were dominated by immune cells. However, they also contained mammary epithelial cells and had high transcriptomic similarity to the mammary gland tissue (13, 22). The polar lipid metabolites mQTLs were discovered using a multitrait metaanalysis (23) of 19 metabolite profiles, such as phosphatidylcholine, phosphatidylethanolamine, and phosphatidylserine (24), from bovine milk fat. The ChIP-seq data used in our analysis contained previously published H3K27Ac and H3K4me3 marks in liver and muscle tissues (25, 26) and newly generated H3K4Me3 marks from the mammary gland. Fig. 2 illustrates some of the properties of these variant sets. Many sQTLs with strong effects on the intron excision ratio (27) were discovered in a metaanalysis of sQTLs mapped in white blood and milk cells, liver, and muscle (13) (Fig. 2). Many significant aseQTLs were discovered using a gene-wise metaanalysis of the effects of the driver variant (dVariant) on the transcript variant (tVariant) at exonic heterozygous sites (28) from white blood and milk cells (Fig. 2). Fig. 2 shows that variants tagged by the marks of H3K4Me3, a marker for promoters, were closer to the transcription start site than other variants.

Fig. 2.

Examples of regulatory and evolutionary signals from the discovery analysis. (A) A Manhattan plot of the metaanalysis of sQTLs from white blood and milk cells and liver and muscle tissues. (B) A Manhattan plot of the metaanalysis of aseQTLs in the white blood cells. (C) A distribution density plot of variants tagged by H3K4Me3 ChIP-seq mark from mammary gland within 2 Mb of gene transcription start site. (D) Artificial selection signatures between 8 dairy and 7 beef cattle breeds with the linear mixed-model approach using the 1000 Bull Genome database. The blue line indicates −log10(P value) = 4. The variant annotation partition had 7 merged sets (Table 1 and ) based on the Variant Effect Prediction of Ensembl (29) and NGS-SNP (30). Additional information of variant function annotation was obtained from the Human Projection of Regulatory Regions (HPRS) as published in ref. 31 and predicted CCCTC-binding factor (CTCF) sites as published in ref. 32. The evolutionary variant sets were discovered from across- and within-species genome analyses. Variants within cross-species conserved sites were lifted over from human genome sites (hg38), those with the PhastCon score >0.9 calculated using genome sequences of 100 vertebrate species. The LiftOver (https://genome.ucsc.edu/cgi-bin/hgLiftOver) rate from human conserved sites to bovine was 92.3%, which was higher than the LiftOver rate using the human sites with the PhastCon score >0.9 across 29 mammalian species (33, 34). Detailed results of the analysis of conserved sites can be found in . The within-species evolutionary analysis used the whole-genome sequence variants from Run 6 of the 1000 Bull Genomes project (35). Those variants with higher frequency in dairy than in beef breeds (“selection signature”; Table 1, Fig. 2, and ) were detected from a GWAS where the breed type was modeled as a binary phenotype in the linear mixed model (36) of 15 beef and dairy breeds. With the 1000 Bull Genomes data, we used a statistic to identify variants possibly subject to recent artificial and/or natural selection, PPRR (the proportion of positive correlations [r] with rare variants). illustrates a coalescence where a mutation has been positively selected, i.e., is relatively young and has increased in frequency rapidly. In this coalescence, the selected mutation was seldom on the same branch as rare mutations, and so the LD r between the selected mutation and rare alleles was typically negative. This was similar to the logic employed by ref. 37. In this partition of the genome, the 1% of variants with the lowest PPRR, after correcting for the variants’ own allele frequency ( and ), were defined as young variants. The quartile categories partitioned the genome variants into 4 sets of variants of similar size based on either their LD score (sum of LD r2 between a variant and all of the variants in the surrounding 50-kb region, GCTA-LDS) (38) or the number of variants within a 50-kb window (variant density) or their minor allele frequency (MAF) (38) (Table 1). Note that the fourth quartile had the highest value, and the first quartile had the lowest value for LD score, MAF, and SNP density.

The Proportion of the Genetic Variance for 34 Traits Explained by Each Set of Variants.

In the test datasets of 11,923 bulls and 32,347 cows, common variants (MAF ≥ 0.001) of the sets described above were used to make GRMs (36). Each of these GRMs was then fitted together with the high-density variant chip GRM (variant number = 632,002) in the GREML analysis to estimate the proportion of additive genetic variance explained by each functional and evolutionary set of variants, , in each of the 34 decorrelated traits, separately in bulls and cows (Table 2). Overall, the ranking of the averaged across 34 traits, , was highly consistent between bulls and cows (r = 0.94). All of the estimates, except that of the intergenic variants, were higher for bull traits than cow traits, consistent with the higher heritability of phenotypic records in bulls than in cows (39) because bull phenotypes are actually the average of many daughter phenotypes of the bull. When the HD variants were fitted alone, they explained on average 17.8% (±2.7%) of the variance in bulls and 4.7% (±1.4%) in cows (). The estimates of mQTLs and the conserved sites across 100 species (termed as “conserved 100 species” in Table 2 and the following text) were much larger than their genome fractions in both sexes (Table 2). For other variant sets, the estimates generally increased with the number of variants in the set. For example, eQTLs, including exon expression QTLs (eeQTLs), sQTLs, and aseQTLs, which included around 5% of the total variants, explained 11 to ∼15% of trait variance in bulls and 2.5 to ∼4% of trait variance in cows. The young variants inferred by the statistic PPRR, which accounted for 0.54% of the total number of variants, explained 0.78% of the trait variance in bulls and 0.12% of the trait variance in cows.

Table 2.

The relative proportion of selected variant in sets compared to the total number of variants analyzed (genome fraction) and their averaged heritability in bulls and cows, across 34 traits

Category	Genome fraction, %	h2¯ in bulls, %	h2¯ in cows, %
eeQTLs	4.77	14.52 (2.2)	3.96 (1.2)
sQTLs	5.57	15.08 (2.5)	3.88 (1.2)
aseQTLs	5.21	11.0 (2.0)	2.47 (0.7)
mQTLs	0.03	0.71 (0.2)	0.12 (0.04)
geQTLs	0.53	1.54 (0.4)	0.19 (0.06)
ChIP-seq	6.60	4.21 (0.8)	0.90 (0.3)
Noncoding.related	0.03	0.06 (0.02)	0.013 (0.004)
Splice.sites	0.06	0.08 (0.02)	0.02 (0.005)
UTR	0.24	0.18 (0.03)	0.03 (0.01)
Coding.related	0.60	0.26 (0.06)	0.04 (0.012)
Geneend	5.70	3.76 (0.8)	0.80 (0.2)
Intron	26.2	5.56 (0.7)	1.53 (0.3)
Intergenic	67.2	10.3 (1.3)	17.3 (2.2)
Predicted CTCF sites	1.43	0.36 (0.08)	0.046 (0.02)
HPRS	0.96	0.31 (0.08)	0.045 (0.02)
Conserved 100 species	2.1	41.4 (2.6)	17.4 (2.3)
Selection signatures	0.02	0.011 (0.004)	0.002 (0.0008)
Young variants	0.54	0.78 (0.2)	0.12 (0.05)
LD score q1	25	4.57 (0.6)	1.18 (0.3)
LD score q2	25	5.56 (0.7)	1.45 (0.3)
LD score q3	25	6.38 (0.8)	1.75 (0.4)
LD score q4	25	6.94 (0.9)	2.01 (0.5)
Variant density q1	25	5.59 (0.7)	1.49 (0.3)
Variant density q2	25	5.42 (0.7)	1.45 (0.3)
Variant density q3	25	5.72 (0.7)	1.55 (0.3)
Variant density q4	25	5.99 (0.7)	1.65 (0.4)
MAF q1	25	1.36 (0.2)	0.35 (0.08)
MAF q2	25	11.5 (1.3)	3.51 (0.7)
MAF q3	25	29.2 (2.4)	10.3 (1.8)
MAF q4	25	40.5 (2.8)	15.6 (2.4)

SEs are in parenthesis. q1 ∼ q4 were the genome partitions based on the first, second, third, and fourth quartiles of MAF, LD score, and the number of variants (variant density) per 50-kb windows. Fourth quartile > third quartile > second quartile > first quartile.

The relative proportion of selected variant in sets compared to the total number of variants analyzed (genome fraction) and their averaged heritability in bulls and cows, across 34 traits SEs are in parenthesis. q1 ∼ q4 were the genome partitions based on the first, second, third, and fourth quartiles of MAF, LD score, and the number of variants (variant density) per 50-kb windows. Fourth quartile > third quartile > second quartile > first quartile. The increased greatly from MAF quartiles 1 to 4. However, the dramatically low estimates for the first MAF quartile may be associated with the reduced imputation accuracy for low MAF variants. By contrast, increased only slightly with LD score and even less with variant density. Estimates of were divided by the number of variants in the set to calculate the per-variant allowing comparison of the genetic importance of variant sets made with a varied number of variants. Since the per-variant was estimated independently in bulls and cows and yet showed high consistency between sexes (), the average per-variant across sexes was used to rank each variant set (Fig. 3). Conserved 100 species and mQTLs made the top of the rankings (Fig. 3), due to their highly concentrated (41.4% in bulls and 17.4% in cows for conserved 100 species, and 0.71% in bulls and 0.12% in cows for mQTLs; Table 2) in a relatively small genome fraction (2.2% and 0.03%, respectively; Table 2). These 2 top sets were followed by several expression QTL sets, including eeQTLs, sQTLs, geQTLs, and aseQTLs (Fig. 3). Similar rankings were achieved by the “non.coding related” set (0.03% of genome variants) that included variants annotated as “non_coding_transcript_exon_variant” and “mature_miRNA_variant” (), the “splice.site” set (0.06% of genome variants, including all of the variants annotated as associated with splicing functions), and the set of young variants (0.54% of genome variants). The “UTR” set, which included variants annotated as within 3′ and 5′ untranslated regions of genes, and the “geneend” set, which included variants annotated as downstream and upstream of genes, both had modest rankings along with the ChIP-seq and selection signatures sets. The “coding.related” set, dominated by variants annotated as synonymous and missense (), ranked higher than the top 1% HPRS, intergenic variants, and predicted CTCF sites. Intron and the first quartile MAF set had the lowest per variant h.

Fig. 3.

The proportion of genetic variances explained by sets of variants selected from functional and evolutionary categories. The ranking of variant sets based on the log10 scale of per-variant , averaged across bulls (left error bar) and cows (right error bar). The impact of MAF on the ranking of variant sets was examined by calculating, for each set, the per-variant expected from the number of variants in a set belonging to each MAF quartile. This MAF expected per-variant was then subtracted from the observed per-variant to calculate the MAF adjusted per-variant (). Excluding the sets based on MAF quartiles, the ranking of the unadjusted per-variant was well correlated (r = 0.9) with their ranking on the MAF adjusted per-variant . These results suggested an overall small impact of MAF on the variant set ranking of per-variant . Variants from sets highly ranked for per-variant were highlighted in important QTL regions with the multitrait GWAS results (Fig. 4). In the expanded region of beta-casein (CSN2), a major but complex QTL for milk protein due to the existence of multiple QTL with strong LD, different high-ranking variant sets tended to tag variants with strong effects from multiple locations (Fig. 4). Many variants with the strongest effects and close to CSN2 were tagged by sQTLs. Several clusters of variants from up- and downstream of CSN2 with slightly weaker effects were tagged by sets of ChIP-seq marks, young variant, and mQTLs. Conversely, for the expanded region of microsomal GST 1 (MGST1), a major QTL for milk fat, variants from high-ranking sets were more enriched in 2 major locations (Fig. 4). The top variant within the MGST1 gene was again a sQTL, confirming previous results that regulatory variants are enriched in this region (13). Although not enriched in the MGST1 peak region, conserved sites tagged many variants that were not tagged by other top sets. The young variant sets appear to have tagged a different variant cluster around 0.7 Mb downstream from MGST1 (Fig. 4).

Fig. 4.

Examples of top-ranked variant sets in important bovine trait QTL. (A) Manhattan plot of the metaanalysis of GWAS of 34 traits in the ±2 Mb region surrounding the beta casein (CSN2) gene, a major QTL for milk protein yield. (B) Manhattan plot of the metaanalysis of GWAS of 34 traits in the ±1 Mb region of the microsomal GST 1 (MGST1) gene, a major QTL for milk fat yield. The dots are colored based on their set memberships. The black bar between the gray dots and the X-axis indicates the gene locations.

The FAETH Score of Sequence Variants.

To quantify the relative importance of variants using a combination of functionality, evolutionary significance as well as their trait heritability, a framework was introduced to score variants based on their memberships to the sets of variants. Each time the genome variants were partitioned into nonoverlapping sets, each variant was a member of only one set and was assigned the per-variant of that variant. Therefore, all variants were assigned the same number (13 partitions) of per-variant , and the average of these 13 partitions was calculated for each variant and called the FAETH score. A criterion of per-variant > per-variant was also imposed to determine whether the variant set was informative. This criterion determined that 2 variant sets (HPRS and predicted CTCF sites) were not informative, and they were not included in the FAETH scoring (). The FAETH score of 17,669,372 sequence variants for their genetic contribution to complex traits has been made publicly available at https://doi.org/10.26188/5c5617c01383b (19). A tutorial of the calculation of FAETH scores after was obtained can be found at https://ruidongxiang.com/2019/07/19/calculation-of-faeth-score-2/.

Variants with High FAETH Score Have Consistent Effects.

In the above analyses, the effect of a variant was estimated across all breeds. However, it is possible to fit a nested model in which both the main effect and an effect of the variant nested within a breed are included. If a variant is causal or in high LD with a causal variant, we might expect the effect to be similar in all breeds. Whereas if the variant is merely in LD with the causal variant, the effect might vary between breeds. Based on the FAETH score, the top 1/3 and bottom 1/3 ranked sequence variants in the Australian data were selected as “high” and “low” ranking variants, respectively. Fig. 5 shows the estimates of across-breed and within-breed variances for both high- and low-ranking variants. In both cases, the within-breed variance was small, but the high-ranking variants had a larger across-breed variance and a smaller within-breed variance than the low-ranking variants. This implied that the FAETH score identified variants with consistent phenotypic effects across breeds.

Fig. 5.

Further tests of the variant FAETH score. (A) The heritability of high and low FAETH ranking variants for the multibreed GRM and the within-breed GRM (2 GRMs fitted together) estimated across 34 traits in the Australian data. The error bars are the SE of heritability calculated across 34 traits. (B) The heritability of high and low FAETH ranking variants for 3 additional traits to the 34 traits in the Australian data used to calculate the FAETH score. (C) The multibreed heritability of high and low FAETH variants for 3 production traits in Danish data. The error bars are the SEs of the heritability of each GREML analysis. (D) Prediction accuracy of gBLUP of 3 production traits in Danish data using high and low FAETH variants (averaged between bulls and cows). The genomic predictors were trained in multiple breeds and predicted into single breeds (HOL, Holstein; JER, Jersey). P values of significant difference based on Z-score test: •P < 0.1; **P < 0.01; ***P < 0.001; ****P < 0.0001. Note that for the prediction accuracy r, the significance of difference was based on the sample sizes of the Danish candidate subset where there were 500 Holstein, 517 Jersey, and 192 Danish Red (). Additional data were obtained to test the FAETH score. Table 3 highlights the FAETH annotation of several causal or putative causal mutations where all of them were categorized as high FAETH ranking. Fig. 5 showed that the high-ranking variants had significantly (Z-score test: P < 0.0001) higher heritability estimates than the low-ranking ones for fat yield, body length, and rump length (original traits, not the Cholesky-transformed traits) that were not part of the Australian dairy 34 traits used to calculate the FAETH score. Also, as a proof of concept, high FAETH-ranking variants had significant enrichment (P = 4.5e−35), with pleiotropic SNPs significantly associated with 32 traits in beef cattle containing B. taurus and B. indicus subspecies (). The enrichment of the low FAETH-ranking variants in these significant beef cattle pleiotropic SNPs was not different from random (). These results supported the generality of the FAETH variant ranking in different traits, breeds, and subspecies.

Table 3.

FAETH annotation of previously identified causal or putative causal mutations for dairy cattle complex traits using the top variant sets

Loci	Causal candidates	Annotation	Tagging variant sets	FAETH ranking
SLC37A1	Chr1:144377960 (45)	Intron	aseQTL	High
DGAT1	Chr14:1802266 (41)	Coding.related	mQTL, eeQTL, sQTL, aseQTL, ChIP-seq	High
FASN	Chr19:51386735 (71)	Intron	mQTL, eeQTL, sQTL, ChIP-seq	High
GHR	Chr20:31909478 (71)	Coding.related	Conserved 100 species	High

“High” means that the variant was ranked within the top 1/3 of the FAETH score.

FAETH annotation of previously identified causal or putative causal mutations for dairy cattle complex traits using the top variant sets “High” means that the variant was ranked within the top 1/3 of the FAETH score.

Validation of the FAETH Score in Danish Cattle.

An independent dataset of 7,551 Danish cattle of multiple breeds was used to test the FAETH score. The Australian high- and low-ranking variants were mapped in the Danish data. In the GREML analysis of Danish data, the high-ranking variants had significantly higher heritability than the low-ranking variants across three production traits (Z-score test: P < 0.001 for protein yield and P < 0.0001 for fat and milk yield) (Fig. 5). The genomic best linear unbiased prediction (gBLUP) of Danish traits was also evaluated where the models were trained in the multiple-breed reference data to predict 3 production traits in each of 3 breeds (3 × 3 = 9 scenarios; Fig. 5). Out of these 9 scenarios, high-ranking variants had higher accuracies than the low-ranking variants in 8 scenarios. Based on the sample sizes of the Danish candidate subset (500 Holstein, 517 Jersey, and 192 Danish Red), the significance levels of the increase in prediction accuracy for the high-ranking variants for these 8 scenarios are specified in Fig. 5.

Discussion

GWASs have been very successful in finding variants associated with complex traits, but they have been less successful in identifying the causal variants because often there are a large group of variants, in high LD with each other (particularly in livestock) that are all associated with the trait. To distinguish among these variants, it would be useful to have information, external to the traits being analyzed, that points to variants that are likely to have an effect on phenotype. In this paper, we have evaluated 30 sources of external information based on genome annotation, evolutionary data, and intermediate traits such as gene expression and milk metabolites. Then, we assessed the variance that each set of variants explained when they were included in a statistical model that also included a constant set of 600,000 SNPs from the bovine HD SNP array. The purpose of this method is to find sets of variants that add to the variance explained by the HD SNPs, presumably because they are in higher LD with the causal variants than the HD SNPs are. Since the causal variants themselves are likely to be among the sequence variants analyzed, this method is a filter for classes of variants that are enriched for causal variants or variants in high LD with them. Although developed in cattle, the general framework of estimating FAETH score by combining the information of functionality, evolution, and complex trait heritability can be directly applied to other species. Additional tests of FAETH outside of the analyzed 34 traits and multiple beef cattle traits and the positive validation results in the Danish data support the across-breed, across-subspecies, and across-country usage of the FAETH score. Further, FAETH score not only contains a ranking of millions of variants that can be used as biological priors for genomic prediction (e.g., BayesRC) (40) but also includes the information of the variant membership to different functional and evolutionary categories. This additional information can be used by other researchers to annotate their variants of interests (e.g., Table 3). Our results agreed with the report in humans (18) that the conserved sites had very strong enrichment of trait heritability. Interestingly, our analysis showed that genomic sites with conservation across a larger number of species appeared to have tagged variants with stronger enrichment of heritability, compared to the sites conserved across a smaller number of species (). It may be worth studying the impact of the extent of the cross-species conservation on the amount of trait variation explained by the tagged variants in the future. Our analysis also highlights the importance of intermediate trait QTL, including QTLs for metabolic traits and gene expression (mQTLs, geQTLs, eeQTLs, sQTLs, and aseQTLs). This is not a surprising result as the significant contribution of different intermediate trait QTLs to complex trait variations have been reported in humans (7, 27, 41–43) and cattle (13, 44–46). An advantage of these intermediate traits over conventional phenotypes is that individual QTL explain a larger proportion of the variance. For instance, cis eQTL tend to have a large effect on gene expression. This increases the signal-to-noise ratio and so increases power to distinguish causal variants from variants in partial LD with them. However, an intermediate QTL mapping study requires a large number of resources, especially when considering different metabolic profiles and tissues with large sample size. In the current analysis, we utilized several methods to combine results from individual studies of intermediate QTL mapping (21, 23, 28) ( and ). This could reduce the noise from individual analyses, and this is likely to increase the chance of finding causal mutations. To our knowledge, no study has systematically compared the genetic importance of mQTLs with eQTLs. The high ranking of mQTLs over eQTLs in our study might be related to the fact that the mQTLs were discovered from the milk fat, and the analyzed phenotype in the test data contained several milk-production traits. However, out of the 5,365 chosen mQTL variants, 961 variants were from the ±2 Mb region of DGAT1, while no mQTLs were from chromosome 5, which harbors MGST1 ( and Fig. 4), both of which are known major milk fat QTL. This suggests that many variants from the mQTL set not only influence milk fat production but may have other functions, including contributing to variation in the general process of fat synthesis, which is active in many mammalian tissues. Several large-scale human studies have highlighted the importance of mQTLs in various complex traits (7, 47). Consistent with previous studies in cattle and humans (13, 27, 43), splicing sQTLs and the related eeQTLs ranked slightly higher than other eQTLs (Fig. 3). Cattle aseQTLs and geQTLs were found to have a similar magnitude of enrichment with trait QTL (28) and this is consistent with the current observation. We proposed a method to identify variants that are young but at a moderate frequency and found this set was enriched for effects on quantitative traits (Figs. 3 and 4). However, Kemper et al. (48) showed that variants identified by selection signatures using traditional methods, such as fixation index (49) and integrated haplotype score (50) had little contribution to complex traits in cattle. In the current study, the selection signatures between beef and dairy cattle (“selection signature” set as shown in Table 1) explained some genetic variation in complex traits, although its contribution is relatively small (Table 2 and Fig. 3). It is possible that the inclusion of many nonproduction traits in the current study increased the chance of finding the trait-related sequence variants that are under artificial selection. Also, the use of sequence variants in the current study may have increased power compared to the study conducted by Kemper et al. (48), which used HD chip variants. The set of variants with low PPRR (“young variants”) had a higher ranking of genetic importance to the complex traits than the other artificial selection signatures (Fig. 3). The identification of relatively young variants is based on the theory that very recent selection will increase the frequency of the favored alleles (37). Thus, the young variant set could contain variants that were either under artificial selection and/or recently appeared, and this may be the reason that it explained more trait variation than the artificial selection signatures. As shown in Fig. 4, many young variants can be found in major production trait QTL. Genome-regulatory elements such as enhancers and promoters are important regulators of gene expression, and they can be identified by ChIP-seq assays. In humans, ChIP-seq–tagged binding QTLs (bQTLs) showed significant enrichments in complex and disease traits (51). We did not have enough individuals with ChIP-seq data to identify bQTLs. However, with only a limited amount of ChIP-seq data, variants tagged by H3K4me3 ChIP-seq showed a closer distance to the transcription start sites (Fig. 2), and H3K4me3 and H3K27ac together tagged variants that had some contribution to complex trait variation (Fig. 3). Also, the FAETH ranking of the ChIP-seq–tagged variant set was similar to the ranking of variant annotation sets of gene end (variants within regions up- and downstream of genes) and UTR (variants within 3′ and 5′ UTR). It is logical that variants with the potential to affect promoters and/or enhancers are annotated as close to genes or located in gene-regulatory regions. The variant annotation sets of noncoding-related and splice sites ranked relatively high for their contribution to trait variation (Fig. 3). Previously, variants annotated as splice sites had a high ranking of genetic importance to cattle complex traits (52). The majority of the variants from the noncoding-related set are “non_coding_transcript_exon_variant” (), which is “a sequence variant that changes noncoding exon sequence in a noncoding transcript” according to VEP (29). This group of variants can be associated with long noncoding RNAs, and they are found to contribute to complex traits in humans (53) and cattle (54). Variants annotated as coding-related, of which the majority of variants are missense and synonymous (), had a relatively low ranking of genetic importance to complex traits (Fig. 3). It seems a surprising result, but Koufariotis et al. (52) also reported similar observations in cattle. Perhaps coding variants that influence phenotype are subject to purifying selection and hence have low heterozygosity and hence low contribution to variance. The contribution of variants with different LD properties to complex traits is an ongoing debate in humans (55–57). In our analysis of cattle, a domesticated species with strong LD between variants, variant LD differences had negligible influence on complex traits (Table 2). Also, variants within regions that have more variants (variant density) did not explain more trait variation. Common variants, as expected (58), had a substantial contribution to complex traits (Table 2 and Fig. 3). Based on the variant membership to differentially partitioned genome sets and the value of the per-variant , the FAETH score of sequence variants combined the information of evolutionary and functional significance and heritability estimates across multiple complex traits for each variant. This analytical framework provides a simple but effective and comprehensive ranking for each variant that entered the analysis. Additional information on functional and/or evolutionary datasets can be easily integrated and linked to the variant contributions for multiple complex traits. A single score for each variant also makes the potential use of FAETH score easy and straightforward. For example, variants can be categorized as high and low FAETH ranking to create biological priors to inform Bayesian modeling for genomic selection (40). Additionally, different genome partitions of the variant sets in the FAETH data can be used to annotate interesting variants such as finding conserved sites that are also eQTLs. For example, we used FAETH data to annotate some causal or potential causal mutations for dairy cattle complex traits (Table 3). These results could improve our understanding of the biology behind the variant contribution to complex traits. The FAETH score was further tested using Australian data. By building the within-breed GRM and comparing it with the multibreed GRM in the Australian data (Fig. 5) using a method proposed by Khansefid et al. (59), our analysis implied that the variants with the high FAETH ranking contained variants with consistent effects across different breeds. Although estimated using 34 traits, our results show that FAETH ranking of variants can distinguish informative and uninformative variants beyond these 34 traits (Fig. 5). Also, FAETH ranking of variants showed signs of being able to identify informative genetic markers for multiple traits in beef cattle including B. indicus subspecies (). All of these results support the general use of FAETH variant scoring across different traits and breeds. The FAETH score based on GREML using multiple Australian breeds was first tested with GREML using multiple Danish breeds (Fig. 5). In this test, variants with high FAETH ranking explained significantly more genetic variance in protein, fat, and milk yield than the low-ranking variants. When the genomic predictors were trained in multiple Danish breeds and used to predict into single breeds (Fig. 5), significant increases in prediction accuracies for the high-FAETH variants were mostly seen in the Holstein breed, and the increases for the Jersey breed were not significant. Several reasons contributed to this, including the most noticeable fact that the Holstein breed, which is genetically distant from the Jersey (60, 61), dominated both Danish (Holstein: Jersey = 5:1) and Australian (Holstein: Jersey = 4:1) populations. The relatively small sample size of the Danish validation population (519 for Jersey and 192 for Danish Red) reduced the power of Z-score test of significance of difference between correlations (i.e., prediction accuracies) of high- and low-FAETH variants. Also, since the Jersey breed has the smallest effective population size (62), it is expected that the advantage of a dense set of selected sequence variants is lowest (or absent) in that breed (63). Future tests in larger populations with increased breed diversities will provide better evaluation of the performance of the FAETH-ranked variants in multibreed analyses. Increasing the breed diversity, sample size, and tissues types in the functional genomic data may also improve the genomic prediction performances of FAETH ranking in specific cattle breeds. Nevertheless, the test results of the FAETH score in additional dairy traits and in beef cattle GWASs support that the FAETH ranking can prioritize informative variants in different populations. In humans, Finucane et al. (18) combined many sources of data to calculate a prior probability that a variant affects a phenotype. Our approach is different from theirs in some respects. They used GWAS summary data and stratified LD score regression, whereas we used raw data and GREML. They fitted all sources of information simultaneously, whereas we fitted one variant set at a time in competition with the HD variants. We were unable to fit all sources at once with GREML for computational reasons but also because the extensive LD in cattle makes it harder than in humans to separate the effects of multiple variant sets. On the other hand, GREML is more powerful than LD score regression (64). Our study demonstrates that the increasing amount of genomic and phenotypic data makes the cattle model a robust and critical resource for testing genetic hypotheses for large mammals. A recent large-scale study for cattle stature also supports the general utility of the cattle model in GWASs (5). In the current study, we highlight the contribution of the variants associated with intermediate QTLs and noncoding RNAs to complex traits, and this is consistent with many observations in human studies (8, 9, 27). However, we also provide contrasting evidence to results from humans. We found LD property of variants (e.g., variants from genomic regions with high LD) had negligible influences on trait heritability, contrasting with the recent evidence for the strong influence of LD property on human complex traits (55). In addition, variants under artificial selection had limited contributions to bovine complex traits, while in humans (where artificial selection is absent), natural selection clearly operates on complex traits (65). While the reasons for these contrasting results are yet to be studied, our findings from cattle add valuable insights into the ongoing discussions of the genetics of complex traits. Our study has limitations. While some discovery analyses of the intermediate QTLs used relatively large sample size, the number of tissues and/or types of “omics” data included for discovering expression QTLs and mQTLs is yet to be increased. Also, in the discovery analysis, the selection criteria for informative variants to be included for building GRMs were relatively simple. In the test analysis, the heritability estimation for different GRMs used the GREML approach, which has been under some debate because of its potential bias (56, 66). Analysis of functional categories by the genomic feature models with BLUP has been previously tested (67), although this method can be computationally intensive. We aimed to treat each discovery dataset as equal as possible, and all GRMs were analyzed in the test dataset in the same systematic way. The positive results from the validation analysis suggest that informative variants have been well captured in the discovery and test analyses. The current version of FAETH score is based on included functional and evolutionary datasets. The FAETH score will be updated as more functional and evolutionary datasets become available.

Conclusions

We provide an extensive evaluation of the contribution of sequence variants with functional and evolutionary significance to multiple bovine complex traits. While developed using genomic and phenotypic data in the cattle model, the analytical approaches for the functional and evolutionary datasets and the FAETH framework of variant ranking can be applied equally well in other species. With their utility demonstrated, the publicly available FAETH score will provide functional and evolutionary annotation for sequence variants and effective and simple-to-implement biological priors for advanced genome-wide mapping and prediction.

Materials and Methods

Discovery Analysis.

Discovery data availability is detailed in . A total of 360 cows from a 3-y experiment at the Ellinbank research facility of Agriculture Victoria in Victoria, Australia, were used to generate RNA-seq and milk fat metabolite datasets. Animal use was approved by Agriculture Victoria Animal Ethics Committee Application 2013-23. The geQTLs, eeQTLs, and sQTLs in white blood and milk cells in a total of 131 Holstein and Jersey cows previously published (13) were used. The geQTLs, eeQTLs, and sQTLs in liver and semitendinosus muscle samples from Angus steers were also used (13). The aseQTLs were discovered using RNA-seq data from white blood and milk cells in a total of 112 Holstein cows (5). The metaanalysis of these 4 types of eQTLs, including (published in refs. 13 and 68), are detailed in . The discovery of polar lipid metabolite mQTLs in bovine milk fat was based on the mass spectrometry-quantified concentration of 19 polar lipids from 338 Holstein cows. The lipid extraction description and the multitrait metaanalysis of single-trait GWASs including (23) can be found in . ChIP-seq marks indicative of enhancers and promoters were discovered from a combination of experimental and published datasets. ChIP-seq peak data of trimethylation at lysine 4 of histone 3 (H3K4me3) from 9 bovine muscle samples (26) and H3K4me3 and acetylation at lysine 27 of histone 3 (H3K27ac) from 4 bovine liver samples (25) were downloaded. The generation of mammary H3K4me3 ChIP-seq peaks from 2 lactating Holstein cows (collected with the approval of Agriculture Victoria Animal Ethics Committee Application 2014-23) is detailed in . The discovery of variant sets with evolutionary significance was based on the whole-genome sequences of Run 6 of the 1000 Bull Genomes project (35). The analysis used a subset of 1,370 cattle of 15 dairy and beef breeds with a linear mixed-model method (). To fully utilize the 1000 Bull Genomes data, the metric PPRR (MAF <0.01), was developed to infer the variant age. PPRR was then calculated as (Eq. 7), where was the PPRR; was the count (N) of all of the positive correlations (r) between the genotypes of common variants and the genotypes of rare variants in a given window with a size of k (k = 50 kb for this study for computational efficiency). was the count of all correlations regardless of the sign. The calculation of can be easily and effectively performed using plink1.9 (www.cog-genomics.org/plink/1.9/). The rationale of PPRR computation is detailed in . Conserved genome sites in cattle were based on the lifted over (https://genome.ucsc.edu/cgi-bin/hgLiftOver) human sites with PhastCon score (34) > 0.9 computed across 100 vertebrate species. The analysis is detailed in . The variant annotation category was based on Ensembl variant Effect Predictor (29) and NGS-variant (30). Several variant annotations were merged from the original annotations to achieve reasonable sizes for GREML (). The gkm SVM score of predicted regulatory potential for bovine genome sites was obtained from the HPRS (31). Variants in our study that overlapped with HPRS and within the top 1% of the SVM score (169,773 variants) were selected. The predicted CTCF sites were obtained from Wang et al. (32) and variants that overlapped these predicted bovine CTCF sites from ref. 32 were selected (252,234 variants). Variant sets based on their distribution of LD score, density, and MAF were created using the GCTA-LDS method (38) based on imputed genome sequences of the test dataset of 11,923 bulls and 32,347 cows (detailed below). Over 17.6 million genome variants were partitioned into 4 quartiles of LD score per region (region size = 50 kb), the number of variants per window (window size = 50 kb), and MAF sets of variants that were used to make GRMs. The quartile partitioning of sequence variants followed the default setting of the GCTA-LDS. As a byproduct of GCTA LD score calculation, the number of variants per 50 kb window was computed, and the quartiles of the value of variant number per region for each variant was used to generate the variant density sets.

Test Analysis.

The test analysis with Australian data, including model , are detailed in . Briefly, a total of 11,923 bulls (data provided by DataGene, http://www.datagene.com.au/ and CRV, https://www.crv4all-international.com/) and 32,347 cows (only provided by DataGene) from Holstein (9,739 ♂/22,899 ♀), Jersey (2,059 ♂/6,174 ♀), mixed breed (0 ♂/2,850 ♀) and Red dairy breeds (125 ♂/424 ♀) with 34 phenotypic traits (deviations for cows and daughter trait deviations for bulls [20]) were used (). The trait decorrelation followed the procedure of Cholesky factorization (21). A total of 17,669,372 imputed sequence variants with Minimac3 imputation accuracy (69) R2 > 0.4 were used as genotype data. The construction of GRM used GCTA (36) and the heritability analysis with 2-GRM REML used MTG2 (70). An online tutorial for calculating FAETH score after the heritability estimation is available at https://ruidongxiang.com/2019/07/19/calculation-of-faeth-score-2/.

Validation Analysis.

The validation used variants within the top 1/3 (high) and bottom 1/3 (low) ranking from the Australian analysis to make GRMs in a total of 7,551 Danish bulls of Holstein (5,411), Jersey (1,203), and Danish Red (937), with a total of 8,949,635 imputed sequence variants in common between the Danish and Australian datasets, with a MAF ≥ 0.002 and imputation accuracy measured by the info score provide by IMPUTE2 ≥ 0.9 in the Danish data (62). Deregressed proofs (DRPs) were available for all animals in the Danish dataset for milk, fat, and protein yield. The Danish dataset was divided into a reference and validation set, where the reference set included 4,911 Holstein, 957 Jersey, and 745 Danish Red bulls, and the candidate set included 500 Holstein, 517 Jersey, and 192 Danish Red bulls. Over 1.25 million high-ranking variants and over 1.25 million low-ranking variants were used to make the high- and low-ranking GRMs. For the individuals in the reference set, each trait of protein, milk, and fat yield was analyzed with the GREML model (Eq. 10) using GCTA (36), where was the vector of DRP of analyzed Danish individuals; β was the vector of fixed effects (breeds); Χ was a design matrix relating phenotypes to their fixed effects; u was the vector of animal effects where , was the genomic relationship matrix between Danish individuals, was the incidence matrix, and e was the vector of residual. This allowed the estimate of of high- and low-ranking variants in the Danish data. To test the variant ranking, genomic prediction with gBLUP was performed by dividing the Danish individuals into reference and validation datasets. The –blup-variant option in GCTA (36) was used to obtain variant effects from the GREML analyses, which were used to predict genomic estimated breeding value (GEBV) in the validation population. Prediction accuracies were computed for each of the breeds in the validation population, as the correlation between GEBV and DRP. More tests of the FAETH score using additional Australian dairy and beef cattle data are detailed in .

68 in total

1. Meta-analysis of genome-wide association studies for cattle stature identifies common genes that regulate body size in mammals.

Authors: Aniek C Bouwman; Hans D Daetwyler; Amanda J Chamberlain; Carla Hurtado Ponce; Mehdi Sargolzaei; Flavio S Schenkel; Goutam Sahana; Armelle Govignon-Gion; Simon Boitard; Marlies Dolezal; Hubert Pausch; Rasmus F Brøndum; Phil J Bowman; Bo Thomsen; Bernt Guldbrandtsen; Mogens S Lund; Bertrand Servin; Dorian J Garrick; James Reecy; Johanna Vilkki; Alessandro Bagnato; Min Wang; Jesse L Hoff; Robert D Schnabel; Jeremy F Taylor; Anna A E Vinkhuyzen; Frank Panitz; Christian Bendixen; Lars-Erik Holm; Birgit Gredler; Chris Hozé; Mekki Boussaha; Marie-Pierre Sanchez; Dominique Rocha; Aurelien Capitan; Thierry Tribout; Anne Barbat; Pascal Croiseau; Cord Drögemüller; Vidhya Jagannathan; Christy Vander Jagt; John J Crowley; Anna Bieber; Deirdre C Purfield; Donagh P Berry; Reiner Emmerling; Kay-Uwe Götz; Mirjam Frischknecht; Ingolf Russ; Johann Sölkner; Curtis P Van Tassell; Ruedi Fries; Paul Stothard; Roel F Veerkamp; Didier Boichard; Mike E Goddard; Ben J Hayes
Journal: Nat Genet Date: 2018-02-19 Impact factor: 38.330

2. Estimation of Genetic Correlation via Linkage Disequilibrium Score Regression and Genomic Restricted Maximum Likelihood.

Authors: Guiyan Ni; Gerhard Moser; Naomi R Wray; S Hong Lee
Journal: Am J Hum Genet Date: 2018-05-10 Impact factor: 11.025

3. Gene expression analysis of blood, liver, and muscle in cattle divergently selected for high and low residual feed intake.

Authors: M Khansefid; C A Millen; Y Chen; J E Pryce; A J Chamberlain; C J Vander Jagt; C Gondro; M E Goddard
Journal: J Anim Sci Date: 2017-11 Impact factor: 3.159

4. Genome-wide survey of SNP variation uncovers the genetic structure of cattle breeds.

Authors: Richard A Gibbs; Jeremy F Taylor; Curtis P Van Tassell; William Barendse; Kellye A Eversole; Clare A Gill; Ronnie D Green; Debora L Hamernik; Steven M Kappes; Sigbjørn Lien; Lakshmi K Matukumalli; John C McEwan; Lynne V Nazareth; Robert D Schnabel; George M Weinstock; David A Wheeler; Paolo Ajmone-Marsan; Paul J Boettcher; Alexandre R Caetano; Jose Fernando Garcia; Olivier Hanotte; Paola Mariani; Loren C Skow; Tad S Sonstegard; John L Williams; Boubacar Diallo; Lemecha Hailemariam; Mario L Martinez; Chris A Morris; Luiz O C Silva; Richard J Spelman; Woudyalew Mulatu; Keyan Zhao; Colette A Abbey; Morris Agaba; Flábio R Araujo; Rowan J Bunch; James Burton; Chiara Gorni; Hanotte Olivier; Blair E Harrison; Bill Luff; Marco A Machado; Joel Mwakaya; Graham Plastow; Warren Sim; Timothy Smith; Merle B Thomas; Alessio Valentini; Paul Williams; James Womack; John A Woolliams; Yue Liu; Xiang Qin; Kim C Worley; Chuan Gao; Huaiyang Jiang; Stephen S Moore; Yanru Ren; Xing-Zhi Song; Carlos D Bustamante; Ryan D Hernandez; Donna M Muzny; Shobha Patil; Anthony San Lucas; Qing Fu; Matthew P Kent; Richard Vega; Aruna Matukumalli; Sean McWilliam; Gert Sclep; Katarzyna Bryc; Jungwoo Choi; Hong Gao; John J Grefenstette; Brenda Murdoch; Alessandra Stella; Rafael Villa-Angulo; Mark Wright; Jan Aerts; Oliver Jann; Riccardo Negrini; Mike E Goddard; Ben J Hayes; Daniel G Bradley; Marcos Barbosa da Silva; Lilian P L Lau; George E Liu; David J Lynn; Francesca Panzitta; Ken G Dodds
Journal: Science Date: 2009-04-24 Impact factor: 47.728

Review 5. 10 Years of GWAS Discovery: Biology, Function, and Translation.

Authors: Peter M Visscher; Naomi R Wray; Qian Zhang; Pamela Sklar; Mark I McCarthy; Matthew A Brown; Jian Yang
Journal: Am J Hum Genet Date: 2017-07-06 Impact factor: 11.025

6. A multi-trait, meta-analysis for detecting pleiotropic polymorphisms for stature, fatness and reproduction in beef cattle.

Authors: Sunduimijid Bolormaa; Jennie E Pryce; Antonio Reverter; Yuandan Zhang; William Barendse; Kathryn Kemper; Bruce Tier; Keith Savin; Ben J Hayes; Michael E Goddard
Journal: PLoS Genet Date: 2014-03-27 Impact factor: 5.917

7. MTG2: an efficient algorithm for multivariate linear mixed model analysis based on genomic information.

Authors: S H Lee; J H J van der Werf
Journal: Bioinformatics Date: 2016-01-10 Impact factor: 6.937

8. Whole-exome sequencing identifies common and rare variant metabolic QTLs in a Middle Eastern population.

Authors: Noha A Yousri; Khalid A Fakhro; Amal Robay; Juan L Rodriguez-Flores; Robert P Mohney; Hassina Zeriri; Tala Odeh; Sara Abdul Kader; Eman K Aldous; Gaurav Thareja; Manish Kumar; Alya Al-Shakaki; Omar M Chidiac; Yasmin A Mohamoud; Jason G Mezey; Joel A Malek; Ronald G Crystal; Karsten Suhre
Journal: Nat Commun Date: 2018-01-23 Impact factor: 14.919

9. Putative bovine topological association domains and CTCF binding motifs can reduce the search space for causative regulatory variants of complex traits.

Authors: Min Wang; Timothy P Hancock; Amanda J Chamberlain; Christy J Vander Jagt; Jennie E Pryce; Benjamin G Cocks; Mike E Goddard; Benjamin J Hayes
Journal: BMC Genomics Date: 2018-05-24 Impact factor: 3.969

10. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index.

Authors: Jian Yang; Andrew Bakshi; Zhihong Zhu; Gibran Hemani; Anna A E Vinkhuyzen; Sang Hong Lee; Matthew R Robinson; John R B Perry; Ilja M Nolte; Jana V van Vliet-Ostaptchouk; Harold Snieder; Tonu Esko; Lili Milani; Reedik Mägi; Andres Metspalu; Anders Hamsten; Patrik K E Magnusson; Nancy L Pedersen; Erik Ingelsson; Nicole Soranzo; Matthew C Keller; Naomi R Wray; Michael E Goddard; Peter M Visscher
Journal: Nat Genet Date: 2015-08-31 Impact factor: 38.330

34 in total

1. A multi-tissue atlas of regulatory variants in cattle.

Authors: Shuli Liu; Yahui Gao; Oriol Canela-Xandri; Sheng Wang; Ying Yu; Wentao Cai; Bingjie Li; Ruidong Xiang; Amanda J Chamberlain; Erola Pairo-Castineira; Kenton D'Mellow; Konrad Rawlik; Charley Xia; Yuelin Yao; Pau Navarro; Dominique Rocha; Xiujin Li; Ze Yan; Congjun Li; Benjamin D Rosen; Curtis P Van Tassell; Paul M Vanraden; Shengli Zhang; Li Ma; John B Cole; George E Liu; Albert Tenesa; Lingzhao Fang
Journal: Nat Genet Date: 2022-08-11 Impact factor: 41.307

2. Rare and population-specific functional variation across pig lines.

Authors: Roger Ros-Freixedes; Bruno D Valente; Ching-Yi Chen; William O Herring; Gregor Gorjanc; John M Hickey; Martin Johnsson
Journal: Genet Sel Evol Date: 2022-06-03 Impact factor: 5.100

Review 3. Improving Genomic Prediction Using High-Dimensional Secondary Phenotypes.

Authors: Bader Arouisse; Tom P J M Theeuwen; Fred A van Eeuwijk; Willem Kruijer
Journal: Front Genet Date: 2021-05-24 Impact factor: 4.599

4. Use of Large and Diverse Datasets for ¹H NMR Serum Metabolic Profiling of Early Lactation Dairy Cows.

Authors: Timothy D W Luke; Jennie E Pryce; Aaron C Elkins; William J Wales; Simone J Rochfort
Journal: Metabolites Date: 2020-04-30

5. Meta-analysis for milk fat and protein percentage using imputed sequence variant genotypes in 94,321 cattle from eight cattle breeds.

Authors: Irene van den Berg; Ruidong Xiang; Janez Jenko; Hubert Pausch; Mekki Boussaha; Chris Schrooten; Thierry Tribout; Arne B Gjuvsland; Didier Boichard; Øyvind Nordbø; Marie-Pierre Sanchez; Mike E Goddard
Journal: Genet Sel Evol Date: 2020-07-07 Impact factor: 4.297

6. Epigenomics and genotype-phenotype association analyses reveal conserved genetic architecture of complex traits in cattle and human.

Authors: Shuli Liu; Ying Yu; Shengli Zhang; John B Cole; Albert Tenesa; Ting Wang; Tara G McDaneld; Li Ma; George E Liu; Lingzhao Fang
Journal: BMC Biol Date: 2020-07-03 Impact factor: 7.431

7. Accelerated deciphering of the genetic architecture of agricultural economic traits in pigs using a low-coverage whole-genome sequencing strategy.

Authors: Ruifei Yang; Xiaoli Guo; Di Zhu; Cheng Tan; Cheng Bian; Jiangli Ren; Zhuolin Huang; Yiqiang Zhao; Gengyuan Cai; Dewu Liu; Zhenfang Wu; Yuzhe Wang; Ning Li; Xiaoxiang Hu
Journal: Gigascience Date: 2021-07-20 Impact factor: 6.524

8. Characterizing Genetic Regulatory Elements in Ovine Tissues.

Authors: Kimberly M Davenport; Alisha T Massa; Suraj Bhattarai; Stephanie D McKay; Michelle R Mousel; Maria K Herndon; Stephen N White; Noelle E Cockett; Timothy P L Smith; Brenda M Murdoch
Journal: Front Genet Date: 2021-05-20 Impact factor: 4.599

9. A conditional multi-trait sequence GWAS discovers pleiotropic candidate genes and variants for sheep wool, skin wrinkle and breech cover traits.

Authors: Sunduimijid Bolormaa; Andrew A Swan; Paul Stothard; Majid Khansefid; Nasir Moghaddar; Naomi Duijvesteijn; Julius H J van der Werf; Hans D Daetwyler; Iona M MacLeod
Journal: Genet Sel Evol Date: 2021-07-08 Impact factor: 4.297

10. Conserved noncoding sequences provide insights into regulatory sequence and loss of gene expression in maize.

Authors: Baoxing Song; Edward S Buckler; Hai Wang; Yaoyao Wu; Evan Rees; Elizabeth A Kellogg; Daniel J Gates; Merritt Khaipho-Burch; Peter J Bradbury; Jeffrey Ross-Ibarra; Matthew B Hufford; M Cinta Romay
Journal: Genome Res Date: 2021-05-27 Impact factor: 9.043