| Literature DB >> 35289906 |
Vinícius Silva Junqueira1,2, Daniela Lourenco3, Yutaka Masuda3, Fernando Flores Cardoso4, Paulo Sávio Lopes2, Fabyano Fonseca E Silva2, Ignacy Misztal3.
Abstract
Efficient computing techniques allow the estimation of variance components for virtually any traditional dataset. When genomic information is available, variance components can be estimated using genomic REML (GREML). If only a portion of the animals have genotypes, single-step GREML (ssGREML) is the method of choice. The genomic relationship matrix (G) used in both cases is dense, limiting computations depending on the number of genotyped animals. The algorithm for proven and young (APY) can be used to create a sparse inverse of G (GAPY~-1) with close to linear memory and computing requirements. In ssGREML, the inverse of the realized relationship matrix (H-1) also includes the inverse of the pedigree relationship matrix, which can be dense with a long pedigree, but sparser with short. The main purpose of this study was to investigate whether costs of ssGREML can be reduced using APY with truncated pedigree and phenotypes. We also investigated the impact of truncation on variance components estimation when different numbers of core animals are used in APY. Simulations included 150K animals from 10 generations, with selection. Phenotypes (h2 = 0.3) were available for all animals in generations 1-9. A total of 30K animals in generations 8 and 9, and 15K validation animals in generation 10 were genotyped for 52,890 SNP. Average information REML and ssGREML with G-1 and GAPY~-1 using 1K, 5K, 9K, and 14K core animals were compared. Variance components are impacted when the core group in APY represents the number of eigenvalues explaining a small fraction of the total variation in G. The most time-consuming operation was the inversion of G, with more than 50% of the total time. Next, numerical factorization consumed nearly 30% of the total computing time. On average, a 7% decrease in the computing time for ordering was observed by removing each generation of data. APY can be successfully applied to create the inverse of the genomic relationship matrix used in ssGREML for estimating variance components. To ensure reliable variance component estimation, it is important to use a core size that corresponds to the number of largest eigenvalues explaining around 98% of total variation in G. When APY is used, pedigrees can be truncated to increase the sparsity of H and slightly reduce computing time for ordering and symbolic factorization, with no impact on the estimates.Entities:
Keywords: genomic information; old data; sparse genomic matrix; variance components
Mesh:
Year: 2022 PMID: 35289906 PMCID: PMC9118993 DOI: 10.1093/jas/skac082
Source DB: PubMed Journal: J Anim Sci ISSN: 0021-8812 Impact factor: 3.338
Standard deviation of variance components and heritability calculated across generations using complete (Full) mixed model equations (MME) and reduced MME after skipping zero elements (Reduced)
| Parameter | Core | Scenario | |
|---|---|---|---|
| Full | Reduced | ||
|
| eigen70 | 0.037 | 0.037 |
| eigen90 | 0.011 | 0.013 | |
| eigen95 | 0.008 | 0.008 | |
| eigen98 | 0.005 | 0.005 | |
|
| eigen70 | 0.028 | 0.028 |
| eigen90 | 0.007 | 0.007 | |
| eigen95 | 0.005 | 0.005 | |
| eigen98 | 0.000 | 0.004 | |
|
| eigen70 | 0.032 | 0.032 |
| eigen90 | 0.011 | 0.011 | |
| eigen95 | 0.005 | 0.005 | |
| eigen98 | 0.005 | 0.005 | |
, additive variance; , residual variance; , heritability.
Scenarios testing the allocation of 1K, 5K, 9K, and 14K individuals in core group to explain 72.03% (eigen70), 91.09% (eigen90), 95.70% (eigen95), and 98.07% (eigen98) of the variation in the genomic matrix (G), respectively, using the Algorithm of Proven and Young (APY).
Figure 4.Distribution of the variance ratio (σe2/σa2) across a different number of generations (i.e., 10, 9, 8, 7, 6, 5, and 4) with pedigree and phenotypic data using different sizes for the core group in the Algorithm for Proven and Young (APY). Two scenarios were considered, where zeros were stored (Full) or not (Reduced). Error bars extend from the hinge to the largest (smallest) no further than 1.5 times the distance between the first and third quartiles. Ratios were estimated under restricted maximum likelihood (REML). Scenarios testing the allocation of 1K, 5K, 9K, and 14K individuals in core group to explain 72.03% (eigen70), 91.09% (eigen90), 95.70% (eigen95), and 98.07% (eigen98) of the variation in the genomic matrix (G), respectively.
Figure 5.Average timing in percentage (ratio between total timing) relative to each operation used in the process of matrix inversion. The average timing and error bars (standard deviation) were calculated across scenarios using a different number of generations in the pedigree and phenotypic and core sizes. The x-axis represents the steps required to invert matrices: finding the ordering, symbolic factorization (Symbolic Fact., setting up the data structure), numerical factorization (Numerical Fact.), and sparse inversion. Two scenarios were considered, where zeros were stored (Full) or not (Reduced).
Figure 6.Timing (in seconds) relative to each operation to invert matrices using a different number of generations in the pedigree and phenotypes and a different number of core animals in the computation of G-1 with the Algorithm for Proven and Young (APY). Matrix inversion steps: finding the ordering (Ordering), symbolic factorization (Symbolic Fact.), numerical factorization (Numerical Fact.), and sparse inversion. Two scenarios were considered, where zeros were stored (Full) or not (Reduced). Scenarios testing the allocation of 1K, 5K, 9K, and 14K individuals in core group to explain 72.03% (eigen70), 91.09% (eigen90), 95.70% (eigen95), and 98.07% (eigen98) of the variation in the genomic matrix (G), respectively.
Descriptive statistics of computing time savings for the matrix operations and the slope of a regression of computing time on generations after removing pedigree and phenotypic data
|
|
|
| Reduced | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min (%) | Mean (%) | Max (%) | SD (%) | Slope |
| Min (%) | Mean (%) | Max (%) | SD (%) | Slope | |||
| eigen70 | Ordering | 1.16 | 24.58 | 50.94 | 16.98 | −0.07 | ** | 9.10 | 23.95 | 52.47 | 16.99 | −0.08 | ** |
| Symbolic Factorization | 0.61 | 4.77 | 16.22 | 6.04 | −0.02 | ** | 0.08 | 4.57 | 18.44 | 6.92 | −0.02 | * | |
| Numerical Factorization | 2.38 | 7.47 | 16.92 | 4.93 | −0.02 | ns | 3.36 | 6.57 | 16.32 | 4.97 | −0.02 | * | |
| Sparse Inversion | 4.67 | 8.08 | 18.25 | 5.08 | −0.02 | ** | 0.59 | 5.80 | 16.47 | 5.71 | −0.01 | ns | |
| eigen90 | Ordering | 21.35 | 32.13 | 42.88 | 8.60 | −0.06 | ** | 13.61 | 26.58 | 42.48 | 11.19 | −0.07 | ** |
| Symbolic Factorization | 4.98 | 9.24 | 13.06 | 3.08 | −0.02 | ** | 3.10 | 5.50 | 9.97 | 2.89 | −0.01 | ** | |
| Numerical Factorization | 2.13 | 6.21 | 11.34 | 3.47 | −0.00 | ns | 3.22 | 6.17 | 8.41 | 2.22 | −0.00 | ** | |
| Sparse Inversion | 4.39 | 6.81 | 8.52 | 1.48 | −0.00 | ns | 2.52 | 5.33 | 8.51 | 2.51 | −0.00 | ns | |
| eigen95 | Ordering | 7.94 | 21.40 | 36.80 | 11.13 | −0.06 | ** | 6.65 | 24.55 | 41.48 | 13.99 | −0.07 | ns |
| Symbolic Factorization | 2.76 | 6.39 | 10.33 | 2.97 | −0.01 | ** | 2.48 | 5.19 | 7.43 | 2.20 | −0.01 | ns | |
| Numerical Factorization | 4.96 | 7.26 | 9.98 | 1.82 | −0.00 | ns | 3.45 | 7.63 | 10.33 | 2.93 | −0.00 | ns | |
| Sparse Inversion | 0.07 | 5.52 | 9.08 | 3.38 | −0.00 | ns | 3.77 | 6.59 | 10.14 | 2.82 | −0.00 | ns | |
| eigen98 | Ordering | 25.04 | 38.19 | 49.13 | 9.85 | −0.07 | ** | 15.79 | 30.57 | 41.97 | 11.27 | −0.07 | ** |
| Symbolic Factorization | 8.28 | 16.33 | 19.63 | 4.61 | −0.03 | ** | 1.26 | 5.79 | 9.71 | 3.72 | −0.02 | ** | |
| Numerical Factorization | 5.92 | 10.15 | 13.99 | 2.89 | −0.00 | ns | 4.32 | 7.91 | 11.39 | 2.35 | −0.00 | ns | |
| Sparse Inversion | 5.34 | 8.09 | 10.32 | 2.14 | −0.01 | ns | 2.85 | 5.88 | 9.32 | 2.25 | −0.00 | ns | |
The benchmark is the model using full pedigree and phenotypic data. The comparison is based on using core group of different sizes in algorithm for proven and young (APY), and based on a full mixed model equations (Full) and a reduced mixed model equations after skipping zero elements (Reduced).
Scenarios testing the allocation of 1K, 5K, 9K, and 14K individuals in core group to explain 72.03% (eigen70), 91.09% (eigen90), 95.70% (eigen95), and 98.07% (eigen98) of the variation in the genomic matrix (G), respectively.
Standard deviation.
Slope of a regression of computing time on generations.
Slope statistical significance: *P < 0.05, **P < 0.10; ns, not significant.