| Literature DB >> 26920147 |
I M MacLeod1,2,3, P J Bowman4,5,6, C J Vander Jagt7,8, M Haile-Mariam9,10, K E Kemper11,12, A J Chamberlain13,14, C Schrooten15, B J Hayes16,17,18, M E Goddard19,20,21.
Abstract
BACKGROUND: Dense SNP genotypes are often combined with complex trait phenotypes to map causal variants, study genetic architecture and provide genomic predictions for individuals with genotypes but no phenotype. A single method of analysis that jointly fits all genotypes in a Bayesian mixture model (BayesR) has been shown to competitively address all 3 purposes simultaneously. However, BayesR and other similar methods ignore prior biological knowledge and assume all genotypes are equally likely to affect the trait. While this assumption is reasonable for SNP array genotypes, it is less sensible if genotypes are whole-genome sequence variants which should include causal variants.Entities:
Mesh:
Year: 2016 PMID: 26920147 PMCID: PMC4769584 DOI: 10.1186/s12864-016-2443-6
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Composition of three different mixed breed training (reference) sets, and several validation sets chosen to represent different levels of relatedness to the training sets
| Training set: description | Training set: total | Training set: number per breed | Validation sets: in order of decreasing relatedness to the Training set |
|---|---|---|---|
| “DANZ” bulls of Dutch, Aust & N. Zealand origin with real genotypes and real phenotypesa | 8920 | 7371 Holstein | 1. 869 Red Holstein bulls |
| “AUS” Australian bulls & cows with real genotypes and real phenotypesa | 16,214 | 11,527 Holstein: | 1. 869 Red Holstein bulls |
| “AUS-Sim” Subset of above AUS set, with real genotypes and simulated phenotypes | 10,314 | 7991 Holstein | 1. 262 Holstein bulls only |
aphenotypes were milk, protein and fat yield: in the case of bulls these are daughter averages from progeny test and all phenotypes were corrected for known fixed effects
Description of BayesRC models used to analyse the SEQ a genotype data
| Name of BayesRC Model | Variant Allocation to Classes I, II and III | Number of variants per classc |
|---|---|---|
| BayesRC Seq | I. NSC (non-synonymous coding) | 45,026 |
| BayesRC Lact | I. NSC & in Lact b genes | 4650 |
| BayesRC RLact | I. NSC & in random set of 790 genes | 4350 |
aSEQ = pruned set of 994,019 genome-wide sequence variants from coding and regulatory regions as well as SNP from a high density genotyping array. Variants were allocated to one of three BayesRC classes as listed
bLact refers to a set of 790 candidate genes shown in an independent study to be differentially expressed in association with altered milk production
cNumbers generally reduced slightly from those listed because variants with MAF < 0.002 in any given training population were also excluded from the analyses
Fig. 1a, b and c Accuracy of genomic prediction for real genotypes with simulated phenotypes (3 traits with h2 = 0.6) with a range of BayesR and BayesRC models (AUS-Sim data). BayesR models used 800 K SNP array genotypes or sequence data (SEQ), while all BayesRC models used SEQ data (models described in Table 2). The results are shown for the three simulated traits: a QTL simulated on variants in or close to a set of 790 Lact genes, b QTL simulated on NSC or REG variants only and c QTL simulated at random genome-wide on NSC, REG and CHIP variants
Average number of QTL estimated per distribution and per class of the BayesRC Lact modela, compared with the true number of simulated QTL
| CLASS | Number of QTL per Distribution | Total per Class | |||
|---|---|---|---|---|---|
| N(0,0.0001 | N(0,0.001 | N(0,0.01 | |||
| Class I | TRUE Number | 436 | 63 | 1 | 500 |
| BayesRC Lact | 444 | 36 | 4 | 484 | |
| Class II | TRUE Number | 3049 | 437 | 14 | 3500 |
| BayesRC Lact | 2512 | 346 | 16 | 2874 | |
| Class III | TRUE Number | 0 | 0 | 0 | 0 |
| BayesRC Lact | 219 | 11 | 1 | 231 | |
| Total per distribution | TRUE Number | 3485 | 500 | 15 | |
| BayesRC Lact | 3175 | 393 | 21 | ||
a Results are for Trait 1 (AUS-Sim data) where QTL were simulated in Lact gene regions only
Fig. 2The observed proportion of true QTL among variants with posterior probabilities falling in one of five bins (bars) compared to the median posterior probability for variants in each bin (lines). Posterior probabilities are calculated as the proportion of iterations that a variant was estimated to have a real effect on the trait. Results are from the AUS-Sim data (real cattle genotypes with 4000 simulated QTL) for three simulated traits with BayesR SEQ, BayesRC Seq and BayesRC Lact models (see Table 2 for description of BayesRC models)
Fig. 3Number of true QTL discovered (log scale) within groups of variants binned on posterior probabilities, for three simulated traits. The sum across all bins is the number of true QTL with posterior probability > 0.01 out of a total of 4000 simulated QTL. Results are shown for the AUS-Sim data (real genotypes with 4000 simulated QTL) applying a range of BayesR and BayesRC models (see Table 2 for description of BayesRC models). Posterior probabilities are calculated as the proportion of iterations that a variant was estimated to have a real effect on the trait
Accuracya of the DANZ training predictions for Fat, Milk and Protein Yield in the Red Holstein bull and the Australian Red cow validation sets
| FAT | MILK | PROTEIN | ||||
|---|---|---|---|---|---|---|
| Analytical Modelb | Red Hol | Aust Red | Red Hol | Aust Red | Red Hol | Aust Red |
| BayesR 800 K | 0.565 (0.001) | 0.344 (0.003) | 0.650 (0.001) | 0.317 (0.003) | 0.603 (0.001) | 0.200 (0.001) |
| BayesR SEQ | 0.572 (0.001) |
| 0.663 (0.002) | 0.308 (0.005) | 0.612 (0.001) | 0.220 (0.003) |
| BayesRC Lact |
| 0.353 (0.002) |
|
|
|
|
| BayesRC RLact | 0.571 (0.001) | 0.352 (0.002) | 0.657 (0.001) | 0.302 (0.005) | 0.612 (0.001) | 0.218 (0.002) |
aEstimated as the average correlation between the genomic prediction and corrected phenotypes. The highest accuracy is in bold font in each column. Numbers in in brackets indicate relative convergence of 5 independent Bayesian MCMC chains (estimated from [SD of the mean accuracy]/√5). Note: the numbers in brackets should not be interpreted as a “standard error” because they are estimated from 5 Bayesian MCMC chains run on the same data set
bBayesR models used either 800 K SNP array (600,640 genotypes) or 994,019 sequence variants (SEQ). The BayesRC model definitions are given in Table 2
Accuracya of the AUS training predictions for Fat, Milk and Protein Yield in the Red Holstein bull and Australian Red cow validation sets
| Fat Yield | Milk Yield | Protein Yield | ||||
|---|---|---|---|---|---|---|
| Analytical Modelb | Red Hol | Aust Red | Red Hol | Aust Red | Red Hol | Aust Red |
| BayesR 800 K | 0.527 (0.002) | 0.265 (0.001) | 0.580 (0.001) | 0.235 (0.005) | 0.530 (0.002) | 0.155 (0.004) |
| BayesR SEQ |
| 0.275 (0.002) | 0.601 (0.004) | 0.258 (0.008) | 0.548 (0.002) | 0.174 (0.005) |
| BayesRC Lact | 0.540 (0.003) |
|
|
|
| 0.154 (0.015) |
| BayesRC RLact | 0.541 (0.002) | 0.272 (0.004) | 0.602 | 0.253 (0.012) | 0.551 (0.002) |
|
aEstimated as the correlation between the predicted genomic values and corrected phenotypes. The highest accuracy is in bold font in each column. Numbers in in brackets indicate relative convergence of 5 independent Bayesian MCMC chains (estimated from [SD of the mean accuracy]/√5). Note: the numbers in brackets should not be interpreted as a “standard error” because they are estimated from 5 Bayesian MCMC chains run on the same data set
BayesR models used either 800 K SNP array (600,640 genotypes) or 994,019 sequence variants (SEQ). The BayesRC model definitions are given in Table 2
Average number of variant effects per non-zero distribution (variances 0.0001 2 g, 0.001 2 g, and 0.01 2 g) of BayesR SEQ and BayesRC Lact modelsa
| Trait | Model | Number of Variant Effects per Distribution | |||||
|---|---|---|---|---|---|---|---|
| N(0,0.0001 | N(0,0.001 | N(0,0.01 | |||||
| AUS | DANZ | AUS | DANZ | AUS | DANZ | ||
| Milk Yield | BayesR SEQ | 4263 | 5239 | 60 | 91 | 7 | 9 |
| BayesRC Lact | 4276 | 5294 | 56 | 89 | 9 | 11 | |
| Fat Yield | BayesR SEQ | 4769 | 5969 | 14 | 28 | 5 | 8 |
| BayesRC Lact | 4774 | 5841 | 24 | 43 | 7 | 10 | |
| Protein Yield | BayesR SEQ | 4604 | 6292 | 40 | 38 | 5 | 6 |
| BayesRC Lact | 4641 | 6292 | 39 | 41 | 7 | 8 | |
a Results are for Milk, Fat and Protein Yield in both the DANZ and AUS training sets
Proportion of non-zero variant effects estimated per distribution, within each class of the BayesRC Lact model for Milk Yield
| Model | Class | Number of Variants | Proportion of Variant Effects per Distribution | |||||
|---|---|---|---|---|---|---|---|---|
| N(0,0.0001 | N(0,0.001 | N(0,0.01 | ||||||
| AUS | DANZ | AUS | DANZ | AUS | DANZ | |||
| BayesR SEQ | N/A |
|
|
|
|
|
|
|
| BayesRC Lact | Class I | 3709 | 3.91 % | 3.76 % | 0.38 % | 0.24 % | 0.07 % | 0.045 % |
| BayesRC Lact | Class II | 57,541 | 1.01 % | 0.65 % | 0.03 % | 0.04 % | 0.004 % | 0.006 % |
| BayesRC Lact | Class III | 847,892 | 0.43 % | 0.57 % | 0.01 % | 0.007 % | 0.0003 % | 0.0007 % |
Results are given for both AUS and DANZ training sets, and are compared to the distribution of variant effects in the BayesR SEQ model (bold figures)
Fig. 4Accuracy of prediction (real DANZ data) per variant class of the BayesRC Lact model compared with BayesR predictions using a matching number of randomly selected variants (BayesR_Random). Accuracy was estimated as the correlation between the predicted value and the Red Holstein phenotypes (for Fat, Milk and Protein Yield). The boxplot shows the median and range of values for all replicates (grey dots representing outliers)
Fig. 5QTL discovery with GWAS (-log10 of p-value) and BayesRC Lact (posterior probability) for Milk and Protein Yield around the casein gene cluster (yellow highlight) and GC gene. The BayesRC variant with the top probability (real AUS data) is shown by a purple diamond in each plot (labelled with chromosome and bp position). The strength of LD (r2) between this top variant and all others is colour coded
Fig. 6QTL discovery with GWAS (-log p-value) and BayesRC Lact (posterior probability) for Milk and Protein Yield across a 1 Mb region of Chromosome 5. The BayesRC variant with the top posterior probability in a given region (real AUS data) is shown by a purple diamond (labelled with chromosome and bp position). The LD (r2) between this variant and all others is colour coded
Fig. 7a and b. QTL discovery: posterior probabilities of variants in the PAEP gene region for BayesRC Lact (a) and BayesR SEQ analysis (b). The BayesRC Lact variant with the top posterior probability (real DANZ data) is shown by a purple diamond in each plot (labelled with chromosome and bp position) and the LD (r2) between this variant and all others is colour coded. The position of the SEQ variants fitted in the model is also shown above
Candidate genes identified by listed variants in coding or regulatory regions with a posterior probability ≥ 0.25 for Milk, Protein or Fat Yield (AUS BayesRC Lact)
| Gene_ID (see names in Table | DEa | Milk Y | Prot.Y | Fat Y | P% | F% | Evidenceb | Variant type (distance from gene or SIFT prediction) | Variant position (chrom : bp) |
|---|---|---|---|---|---|---|---|---|---|
| ROBO1 |
| + | + | P | upstream (1823 bp) | 1:26212317 | |||
| SLC37A1 |
| + | L,D | downstream (4005 bp) | 1:144441230 | ||||
| PSMB2 |
| - | - | P,L | missense (SIFT:deleterious) | 3:110752811 | |||
| OGDH |
| + | + | P | downstream (4105 bp) | 4:77454411 | |||
| MYH9 |
| + | + | P,L | upstream (1635 bp) | 5:75181544 | |||
| NCF4 |
| + | - | - | P,L,V | missense (SIFT:tolerated) | 5:75659419 | ||
| ARNTL2 |
| - | - | P | upstream (3413 bp) | 5:82942569 | |||
| MGST1 |
| + | - | - | P,V,D | upstream (4589 bp) | 5:93954751 | ||
| CSN2 |
| + | L,V,D | intron | 6:87180731 | ||||
| CSN3 |
| - | - | P,L,D | missense (SIFT:tolerated) | 6:87390576 | |||
| GC |
| + | + | - | - | P,L,V | upstream (2582 bp) | 6:88741762 | |
| RDH8 |
| - | L | missense (SIFT:deleterious) | 7:15815974 | ||||
| TTC7B |
| + | D | downstream (3086 bp) | 10:103182221 | ||||
| PROM2 |
| - | D | missense (SIFT:tolerated) | 11:2003275 | ||||
| PAEP |
| + | + | - | P,L,V,D | missense (SIFT:tolerated) | 11:103303475 | ||
| ABO |
| + | L,D | downstream (2688 bp) | 11:104229609 | ||||
| DGAT1 |
| + | + | - | - | - | P,L,V | intron | 14:1801116 |
| COX6C |
| + | + | P,L | downstream (1091 bp) | 14:66648812 | |||
| TRIM29 |
| - | + | P, D | downstream (658 bp) | 15:31212485 | |||
| KRT19 |
| - | - | P,L,D | missense (SIFT:tolerated) | 19:42366926 | |||
| PTRF |
| - | - | P,D | upstream (4742 bp) | 19:43166907 | |||
| ERGIC1 |
| - | L,D | intron | 20:4543452 | ||||
| GHR |
| + | D,V | downstream (4947 bp) | 20:31885789 | ||||
| SMEK1 |
| + | + | - | P,V | downstream (2777 bp) | 21:56798101 | ||
| WARS |
| - | - | P,L,D | intron | 21:66916247 | |||
| MLH1 |
| - | L,V | synonymous | 22:10493668 | ||||
| GMDS |
| + | D | intron | 23:51280200 | ||||
| MARF1 |
| + | + | P | downstream (24 bp) | 25:14138518 | |||
| SCD |
| + | D | downstream (1134 bp) | 26:21140458 | ||||
| PRDX3 |
| - | - | P,L | upstream (3744 bp) | 26:39685136 |
The relative direction of the variant effect on milk traits is shown as ‘+’ or ‘-‘. The direction of effects for fat and protein percent (F%, P%) are included if their posterior probability was > 0.2 (AUS BayesRC Lact) as further validation of the Yield traits
aThe strength of RNAseq differential gene expression in lactating mammary tissue compared to 17 other body tissues [22]. Differential expression is indicated if log2 fold change (LFC) > 1 (ie. >21 increase in expression) and p-value < 1.0e-4 and “n” indicates no differential expression. The strength of expression is indicated as + for a LFC value between 1 to 2, ++ for 2 to 5, +++ for 5 to10 and ++++ for above 10
bEvidence for candidate genes included one or more of the following: a member of the Lact gene set (L), associated with more than one milk trait (P), differentially expressed in mammary tissue (D), and/or validated in the DANZ analysis (V)
Full names of the candidate genes listed in Table 8
| Official Gene Symbol | Gene Name |
|---|---|
| ROBO1 | roundabout, axon guidance receptor, homolog 1 (Drosophila) |
| SLC37A1 | similar to solute carrier family 37 member 1 |
| PSMB2 | proteasome (prosome, macropain) subunit, beta type, 2 |
| OGDH | oxoglutarate (alpha-ketoglutarate) dehydrogenase (lipoamide) |
| MYH9 | myosin, heavy chain 9, non-muscle |
| NCF4 | neutrophil cytosolic factor 4 |
| ARNTL2 | aryl hydrocarbon receptor nuclear translocator-like 2 |
| MGST1 | microsomal glutathione S-transferase 1 |
| CSN2 | beta casein |
| CSN3 | kappa casein |
| GC | group-specific component (vitamin D binding protein) |
| RDH8 | retinol dehydrogenase 8 (all-trans) |
| TTC7B | tetratricopeptide repeat domain 7B |
| PROM2 | prominin 2 |
| PAEP | beta lactoglobulin |
| ABO | ABO blood group (transferase A, alpha 1-3-N-acetylgalactosaminyltransferase; transferase B, alpha 1-3-galactosyltransferase) |
| DGAT1 | diacylglycerol O-acyltransferase homolog 1 |
| COX6C | cytochrome c oxidase subunit VIc |
| TRIM29 | tripartite motif containing 29 |
| KRT19 | keratin 19 |
| PTRF | polymerase I and transcript release factor |
| ERGIC1 | endoplasmic reticulum-golgi intermediate compartment 1 |
| GHR | growth hormone receptor |
| SMEK1 | SMEK homolog 1, suppressor of mek1 |
| WARS | tryptophanyl-tRNA synthetase |
| MLH1 | mutL homolog 1, colon cancer, nonpolyposis type 2 |
| GMDS | GDP-mannose 4,6-dehydratase |
| MARF1 | Meiosis arrest female 1(alias: KIAA0430) |
| SCD | stearoyl-CoA desaturase (delta-9-desaturase) |
| PRDX3 | peroxiredoxin 3 |