Literature DB >> 28469846

Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction.

Hao Cheng^1,2, Dorian J Garrick^1,3, Rohan L Fernando¹.

Abstract

BACKGROUND: A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Prediction, using whole-genome data. Leave-one-out cross validation can be used to quantify the predictive ability of a statistical model.
METHODS: Naive application of Leave-one-out cross validation is computationally intensive because the training and validation analyses need to be repeated n times, once for each observation. Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.
RESULTS: Efficient Leave-one-out cross validation strategies is 786 times faster than the naive application for a simulated dataset with 1,000 observations and 10,000 markers and 99 times faster with 1,000 observations and 100 markers. These efficiencies relative to the naive approach using the same model will increase with increases in the number of observations.
CONCLUSIONS: Efficient Leave-one-out cross validation strategies are presented here, requiring little more effort than a single analysis.

Entities: Disease Species

Keywords: GBLUP; Leave-one-out cross validation

Year: 2017 PMID： 28469846 PMCID： PMC5414316 DOI： 10.1186/s40104-017-0164-6

Source DB: PubMed Journal: J Anim Sci Biotechnol ISSN： 1674-9782

Background

A random multiple-regression model that simultaneously fit all allele substitution effects for additive markers or haplotypes as uncorrelated random effects was proposed for Best Linear Unbiased Prediction (BLUP) [1], using whole-genome data. Breeding values are defined as the sum of the effects of all the markers or haplotypes, and their estimates are widely used for prediction of the merit of selection candidates. Estimates of marker or haplotype effects are used to predict breeding values of individuals that were not present in a previous analysis commonly referred to as training. An alternative earlier published approach to use marker or haplotype information fits breeding values as random effects based on covariances defined by a “genomic relationship matrix” computed from genotypes [2]. These two models have been shown to be equivalent in terms of predicting breeding values [3, 4] and we refer to them here as marker effect models (MEM) or breeding value models (BVM), the latter often known as Genomic Best Linear Unbiased Prediction (GBLUP). Cross validation is often used to quantify the predictive ability of a statistical model. In k-fold cross validation, the whole dataset is partitioned into k parts with k analyses, where one part is omitted for training with validation on the omitted part. Leave-one-out cross validation (LOOCV) is a special case of k-fold cross validation with k=n, the number of observations. When the dataset is small, leave-one-out cross validation is appealing as the size of the training set is maximized. However, naive application of LOOCV is computationally intensive, requiring n analyses. We show below how LOOCV can be performed using either the MEM or BVM with little more effort than is required for a single analysis with n observations.

Methods

Use of the MEM is more efficient when the number n of individuals is larger than the number p of markers, because for this model the mixed model equations are of order p plus the number of other effects. When n

Marker effect models

The MEM for GBLUP can be written as where , a n×1 vector for phenotypes, has been pre-corrected for all fixed effects other than μ, the overall mean, is the n×p matrix of marker covariates, is a p×1 random vector of the allele substitution effects and is a n×1 random vector of residuals. Often it is assumed that marker effects are identically and independently distributed (iid) random variables with null means and variances . Thus, under the usual assumption that the residuals are iid with null means and variances , E(y)=1 μ. When MEM is used, LOOCV can be performed by using a well-known strategy used in least-squares regression to compute the predicted residual sum of square (PRESS) [5] statistic.

LOOCV strategy for MEM

BLUP of can be obtained by solving the mixed model equations where ∗= [ 1 ], , is a diagonal matrix whose elements are 0 followed by a p vector of 1s and . Now, BLUP for , where observation j is left out, can be obtained as where is ∗ with the jth row removed and − is with the jth element removed. Suppose is the jth row of ∗, then from the matrix inverse lemma [4], where the quadratic is the jth diagonal element of = ∗( ∗′ ∗+ λ)−1 ∗′. Using (3) in (4), the prediction residual for the jth observation can be written as These prediction errors can be squared and accumulated over n realizations to compute PRESS defined as . The accuracy of genomic prediction is often quantified as the correlation between the predicted and observed values of y , and that correlation can be estimated from the values of , which can be computed efficiently as , using the observed values of y . When a specific group of individuals is of interest, prediction accuracies and PRESS can also be calculate using for individuals in that group.

Breeding value models

When n where u=X β, , Z is the identity matrix of order n and other variables are as in the MEM. Further, in both models E(y)=1 μ, and . These two models are said to be equivalent [6], and linear functions predicted from one model are identical to corresponding predictions from the other model. Two efficient strategies for LOOCV using the BVM are shown below.

LOOCV strategy I for BVM

The mixed model equations for this model are: where . Due to the relative order of the coefficient matrices for the MEM and the BVM, when n where the quadratic is the jth diagonal element of = ∗( ∗′ ∗+ λ)−1 ∗′.

LOOCV strategy II for BVM

Another efficient strategy for BVM is shown here. First we consider the situation where has been pre-corrected for μ in addition to nuisance effects so that E(y)=0 and we define . Now matrix is constructed by augmenting the covariance matrix of with one leading row and column as To obtain the prediction error for observation j, the second row and column of Q are permuted with row and column j+1. In this manner Q has its rows and columns symmetrically permuted as P j′Q P , where the permutation matrix P is obtained by permuting the second row of the n order identity matrix with row j+1. So the permuted matrix is: where we will define the leading 2×2 matrix as , and the other partitions as , and C=V −, where −j denotes that the jth element, row or column has been removed. Defining W 11 as the top left or leading 2×2 sub-matrix in W −1 corresponding to the position of A in W, and using partitioned inverse-matrix identities [7], the inverse of W 11 can be written as, Now in element (2,1) of the above inverse matrix is the vector of covariances between y and y − and is the inverse of the covariance matrix of y −. Thus, is the Best Linear Predictor (BLP) of y given y −, and element (2,1) of (10) is the prediction error of y . The element (2,2) in (10) is the prediction error variance (PEV) for y , where . PEV can also be used to calculate theoretical reliability for individual i as , and characterizing the distributions of reliability for all the individuals in a dataset has a number of practical applications. Note this allows us to obtain the PEV of every individual and the distribution of these values provide information as to the robustness of genomic predictions across the population of individuals represented in the dataset. This PEV is determined by the genomic variance-covariance matrix and does not depend on y. Two different datasets could generate the same PRESS statistic but with different distributions of PEV. Now, because the permutation matrix P is orthogonal, , and the elements of W 11 that are of interest in terms of predicting individual j can be obtained directly from Q −1 as It follows that , which is the off-diagonal element of the inverse of the 2×2 matrix W 11, can be written in terms of Q −1 as where q is the element from row i and column j of Q −1. Thus, once Q −1 is computed, for all j can be computed using (12), and these values can be used to compute PRESS as . To estimate the correlation between the predicted and observed values of y , the value of is efficiently obtained as the difference . Now we consider the situation without pre-correcting y for μ, where E(y)=1 μ. Now the mixed model (7) contains both fixed and random effects. Note that the mixed model equations that correspond to this mixed effects model can be derived by treating μ as “random” with null mean and large variance. So, let for sufficiently large value of . Then under this assumption, E(y)=0 and , and thus is the BLP from the random effects rather than mixed effects model of y given y −. This BLP obtained from the model with random μ will be numerically very close to the BLUP obtained from the mixed model with fixed μ. The Q matrix corresponding to the BLP with random μ is constructed as and prediction residuals are obtained as (12).

Numerical example

Phenotypes and genotypes at 5 markers for 3 individuals are in Table 1. Assume and the overall mean μ is the only fixed effect. In LOOCV strategy for MEM and strategy I for BVM, the diagonal elements of for MEM and for BVM, which are in the denominators of (6) and (9), are in Table 2. The numerators of (6) and (9) are obtained by solving the MME (2) and (8). Then prediction errors are calculated as in (6) and (9) and shown in Table 4. In LOOCV strategy II for BVM, the matrix (Table 3) is constructed using , which is sufficiently large relative to for μ to be indistinguishable from a fixed effect with a flat prior. The prediction errors are calculated as (12) and shown in Table 4. The MEM strategy and BVM strategy I gave identical prediction errors and identical PRESS for this numerical example were numerically very close to those from the BVM strategy II.

Table 1

Phenotypes and genotypes at 5 markers for 3 individuals used in the numerical example

	M1	M2	M3	M4	M5	Phenotypes
1	1	2	1	2	2	1.97
2	2	1	0	1	1	2.12
3	0	0	2	1	2	–0.62

Table 2

Diagonal elements of in LOOCV strategy for MEM and for BVM

	j=1	j=2	j=3
H _jj	0.46	0.51	0.55
C _jj	0.46	0.51	0.55

Table 4

Prediction errors from different LOOCV strategies (different strategies gave identical prediction errors)

	j=1	j=2	j=3
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\hat {e_{j}}$\end{document}ej^	1.13	1.21	–2.66

Table 3

Q matrix in strategy II for BVM

	1	2	3	4
1	8.75	1.97	2.12	–0.62
2	1.97	1,002.40	1,000.80	1,000.80
3	2.12	1,000.80	1,001.70	1,000.30
4	–0.62	1,000.80	1,000.30	1,001.90

Phenotypes and genotypes at 5 markers for 3 individuals used in the numerical example Diagonal elements of in LOOCV strategy for MEM and for BVM Q matrix in strategy II for BVM Prediction errors from different LOOCV strategies (different strategies gave identical prediction errors)

Simulation to compare efficiency

Two datasets were simulated using XSim [8], where 1,000 offspring were sampled from random mating of 100 parents for 10 non-overlapping generations, to compare the computational efficiencies for naive and efficient strategies using BVM or MEM for LOOCV in GBLUP. Dataset I was simulated with 1,000 observations and 10,000 SNP markers for a p≫n scenario. Dataset II was simulated with 1,000 observations and 100 markers for a n≫p scenario. The processor used in the analyses was a 1.4 GHz Intel Core i5 with 4 GB of memory. For dataset II, efficient MEM is 99 times faster than the naive application (2.979 s versus 0.030 s) (Table 5). All strategies implemented in Julia, a scientific programming language, gave virtually identical prediction accuracies defined as the correlation between y and for each dataset. For dataset I, efficient BVM is 786 times faster than the naive application (3.107 s versus 2,442.59 s) (Table 5).

Table 5

Efficiency of alternative LOOCV strategies for GBLUP

	Alternative LOOCV strategies
	Naive MEM	Naive BVM	Efficient MEM	Efficient BVM I	Efficient BVM II
n=1,000; p=10,000	9,490.608	2,442.590	105.141	3.107	5.945
n=1,000; p=100	2.979	169.928	0.030	2.725	0.217

Results are given for the computing time in seconds using naive MEM, naive BVM, efficient MEM, efficient BVM I and efficient BVM II

Efficiency of alternative LOOCV strategies for GBLUP Results are given for the computing time in seconds using naive MEM, naive BVM, efficient MEM, efficient BVM I and efficient BVM II

Discussion

In genomic prediction, the candidates to be predicted are often offspring that are genotyped but not yet phenotyped. In this situation, LOOCV using all individuals in the training dataset will provide an upper bound for the accuracy of prediction, because ancestors in the training dataset with large numbers of descendants have more accurate predictions than descendants. A better estimate of the accuracy of prediction can be obtained by applying LOOCV to only terminal offspring in the training dataset.

Conclusions

Efficient strategies for LOOCV in GBLUP are presented in this paper. LOOCV strategy I and II for BVM are more efficient when p≫n. LOOCV strategy for MEM is more efficient when n≫p. The accuracy of genomic prediction is often quantified as the correlation between the predicted and observed values of y , and this correlation can be estimated efficiently using LOOCV strategies. Compared to naive application of LOOCV, which is computationally intensive, LOOCV can be implemented efficiently.

4 in total

1. Prediction of total genetic value using genome-wide dense marker maps.

Authors: T H Meuwissen; B J Hayes; M E Goddard
Journal: Genetics Date: 2001-04 Impact factor: 4.562

2. Effect of total allelic relationship on accuracy of evaluation and response to selection.

Authors: A Nejati-Javaremi; C Smith; J P Gibson
Journal: J Anim Sci Date: 1997-07 Impact factor: 3.159

3. Technical note: Derivation of equivalent computing algorithms for genomic predictions and reliabilities of animal merit.

Authors: I Strandén; D J Garrick
Journal: J Dairy Sci Date: 2009-06 Impact factor: 4.034

4. XSim: Simulation of Descendants from Ancestors with Sequence Data.

Authors: Hao Cheng; Dorian Garrick; Rohan Fernando
Journal: G3 (Bethesda) Date: 2015-05-07 Impact factor: 3.154

4 in total

17 in total

1. Incorporation of parental phenotypic data into multi-omic models improves prediction of yield-related traits in hybrid rice.

Authors: Yang Xu; Yue Zhao; Xin Wang; Ying Ma; Pengcheng Li; Zefeng Yang; Xuecai Zhang; Chenwu Xu; Shizhong Xu
Journal: Plant Biotechnol J Date: 2020-09-02 Impact factor: 9.803

2. A multi-conformational virtual screening approach based on machine learning targeting PI3Kγ.

Authors: Jingyu Zhu; Yingmin Jiang; Lei Jia; Lei Xu; Yanfei Cai; Yun Chen; Nannan Zhu; Huazhong Li; Jian Jin
Journal: Mol Divers Date: 2021-06-23 Impact factor: 3.364

3. Gene Environment Interactions and Predictors of Colorectal Cancer in Family-Based, Multi-Ethnic Groups.

Authors: S Pamela K Shiao; James Grayson; Chong Ho Yu; Brandi Wasek; Teodoro Bottiglieri
Journal: J Pers Med Date: 2018-02-16

4. Personalized Nutrition-Genes, Diet, and Related Interactive Parameters as Predictors of Cancer in Multiethnic Colorectal Cancer Families.

Authors: S Pamela K Shiao; James Grayson; Amanda Lie; Chong Ho Yu
Journal: Nutrients Date: 2018-06-20 Impact factor: 5.717

5. Construction of prognostic microRNA signature for human invasive breast cancer by integrated analysis.

Authors: Wei Shi; Fang Dong; Yujia Jiang; Linlin Lu; Changwen Wang; Jie Tan; Wen Yang; Hui Guo; Jie Ming; Tao Huang
Journal: Onco Targets Ther Date: 2019-03-15 Impact factor: 4.147

6. Does Chronic Intestinal Inflammation Promote Atrial Fibrillation: A Mendelian Randomization Study With Populations of European Ancestry.

Authors: LaiTe Chen; ChenYang Jiang
Journal: Front Cardiovasc Med Date: 2021-05-10

7. Predicting the accuracy of genomic predictions.

Authors: Jack C M Dekkers; Hailin Su; Jian Cheng
Journal: Genet Sel Evol Date: 2021-06-29 Impact factor: 4.297

8. Gene-Metabolite Interaction in the One Carbon Metabolism Pathway: Predictors of Colorectal Cancer in Multi-Ethnic Families.

Authors: S Pamela K Shiao; James Grayson; Chong Ho Yu
Journal: J Pers Med Date: 2018-08-06

9. Predictors of the Healthy Eating Index and Glycemic Index in Multi-Ethnic Colorectal Cancer Families.

Authors: S Pamela K Shiao; James Grayson; Amanda Lie; Chong Ho Yu
Journal: Nutrients Date: 2018-05-26 Impact factor: 5.717

10. Gene-environment interactions and predictors of breast cancer in family-based multi-ethnic groups.

Authors: Mildred C Gonzales; James Grayson; Amanda Lie; Chong Ho Yu; Shyang-Yun Pamela K Shiao
Journal: Oncotarget Date: 2018-06-26