| Literature DB >> 25707435 |
Dan He, Zhanyong Wang, Laxmi Parida.
Abstract
MOTIVATION: Given a set of biallelic molecular markers, such as SNPs, with genotype values on a collection of plant, animal or human samples, the goal of quantitative genetic trait prediction is to predict the quantitative trait values by simultaneously modeling all marker effects. Quantitative genetic trait prediction is usually represented as linear regression models which require quantitative encodings for the genotypes: the three distinct genotype values, corresponding to one heterozygous and two homozygous alleles, are usually coded as integers, and manipulated algebraically in the model. Further, epistasis between multiple markers is modeled as multiplication between the markers: it is unclear that the regression model continues to be effective under this. In this work we investigate the effects of encodings to the quantitative genetic trait prediction problem.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25707435 PMCID: PMC4571493 DOI: 10.1186/1471-2105-16-S1-S10
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The data-driven encoding for pairwise epistasis of markers .
The correlation between top MI (Mutual Information) ranks by different encoding.
| Genotype Encoding | E1 | E2 | E3 | E4 |
|---|---|---|---|---|
| E1 | 1 | 0.003 | 0.001 | 0.001 |
| E2 | 0.001 | 1 | 0.001 | 0.004 |
| E3 | 0.001 | 0.001 | 1 | 0.001 |
| E4 | 0.003 | 0.003 | - | 1 |
The r2 of predicted trait value under encoding sets {0, 1, 2} and {−1, 0, 1} for the epistasis model in Formula 2 on the Dent data set.
| Dataset | {0, 1, 2} | {−1, 0, 1} |
|---|---|---|
| Dent 1 Tass | 0.59 | 0.457 |
| Dent 2 DMC | 0.562 | 0.481 |
| Dent 3 DM Yield | 0.321 | 0.211 |
Performance (average r2 over 10 randomly simulated data sets) of rrBLUP for different genotype encodings and different number of contributing genotypes s.
| s | Traditional Encoding | Pure Data-driven Encoding | Hybrid Data-driven Encoding |
|---|---|---|---|
| 5 | 0.1095 | 0.0239 | |
| 10 | 0.0569 | 0.0512 | |
| 20 | 0.1841 | 0.1334 | |
| 50 | 0.0151 | 0.1108 | |
| 100 | 0.1420 | 0.2147 | |
| 200 | 0.1267 | 0.2073 | |
Performance of rrBLUP (average r2) on the traits of four real data sets under the traditional encoding vs. the hybrid data-driven encoding.
| Data Set | Traditional Encoding | Hybrid Data-driven Encoding | Improvement |
|---|---|---|---|
| Rice: Pericarp.color | 0.433 | 0.504 | 16.4% |
| Rice: Protein.content | 0.176 | 0.177 | 0.6% |
| Pig: Trait 2 | 0.237 | 0.239 | 0.8% |
| Pig: Trait 4 | 0.203 | 0.218 | 7.4% |
| QTLMAS: Trait 1 | 0.358 | 0.361 | 0.8% |
| QTLMAS: Trait 2 | 0.187 | 0.18 | -3.7% |
| Maize: Flint 1 TASS | 0.47 | 0.492 | 4.7% |
| Maize: Flint 2 DMC | 0.301 | 0.308 | 2.3% |
| Maize: Flint 3 DM_Yield | 0.057 | 0.068 | 19.3% |
| Maize: Dent 1 Tass | 0.59 | 0.616 | 4.4% |
| Maize: Dent 2 DMC | 0.562 | 0.58 | 3.2% |
| Maize: Dent 3 DM_Yield | 0.321 | 0.349 | 8.7% |
Performance (average r2) of rrBLUP for the single marker model (Formula 1) and epistasis model (Formula 2) under the traditional encoding vs. the data-driven encoding.
| Dent | ||||
|---|---|---|---|---|
| 1 | 0.590 | 0.616 | 0.590 | |
| 2 | 0.552 | 0.58 | 0.552 | |
| 3 | 0.321 | 0.349 | 0.349 | |
| Phenotype | rrBLUP (T) | rrBLUP (D) | Epistasis (T) | Epistasis (D) |
| 1 | 0.470 | 0.492 | 0.476 | |
| 2 | 0.301 | 0.308 | 0.312 | |
| 3 | 0.057 | 0.068 | 0.096 | |
rrBLUP (T) is the single marker model under the traditional encoding. rrBLUP (D) is the single marker model under the data-driven encoding. Epistasis (T) is the epistasis model under the traditional encoding. Epistasis (D) is the epistasis model under the data-driven encoding.