| Literature DB >> 31824557 |
Yang Liu1,2, Duolin Wang2,3, Fei He3,4, Juexin Wang2,3, Trupti Joshi1,3,5, Dong Xu1,2,3.
Abstract
Genomic selection uses single-nucleotide polymorphisms (SNPs) to predict quantitative phenotypes for enhancing traits in breeding populations and has been widely used to increase breeding efficiency for plants and animals. Existing statistical methods rely on a prior distribution assumption of imputed genotype effects, which may not fit experimental datasets. Emerging deep learning technology could serve as a powerful machine learning tool to predict quantitative phenotypes without imputation and also to discover potential associated genotype markers efficiently. We propose a deep-learning framework using convolutional neural networks (CNNs) to predict the quantitative traits from SNPs and also to investigate genotype contributions to the trait using saliency maps. The missing values of SNPs are treated as a new genotype for the input of the deep learning model. We tested our framework on both simulation data and experimental datasets of soybean. The results show that the deep learning model can bypass the imputation of missing values and achieve more accurate results for predicting quantitative phenotypes than currently available other well-known statistical methods. It can also effectively and efficiently identify significant markers of SNPs and SNP combinations associated in genome-wide association study.Entities:
Keywords: deep learning; genome-wide association study; genomic selection; genotype contribution; soybean
Year: 2019 PMID: 31824557 PMCID: PMC6883005 DOI: 10.3389/fgene.2019.01091
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Summary of soybean experimental dataset.
| Dataset | Trait | Environment | Sample ( | Heritability | Reference |
|---|---|---|---|---|---|
| SoyNAM | Yield | 2013 Illinois | 5,001 | 0.512 | ( |
| Protein | 2012 Illinois | 5,128 | 0.545 | ||
| Oil | 2012 Illinois | 5,128 | 0.617 | ||
| Moisture | 2012 Illinois | 5,128 | 0.582 | ||
| Height | 2013 Illinois | 5,138 | 0.667 |
Figure 1Dual-stream CNN model structure. Genotypes are one-hot coded and passed to the input processing block, which contains two streams of CNNs. The first stacked-CNN stream contains two feed-forward CNN layers with kernel sizes 4 and 20. The second single-CNN stream contains one CNN layer with kernel size 4, followed by an add-up layer to aggregate outputs from the two streams. The feature processing block contains another single convolution layer with kernel size 4. Processed features are then passed to the output processing block, which contains a flatten layer and a fully connected dense layer. CNN, convolutional neural network.
Average Pearson correlation coefficient of five traits from cross-validation.
| Yield | Protein | Oil | Moisture | Height | |
|---|---|---|---|---|---|
| dualCNN (imp/non-imp) | 0.434/0.452 | 0.402/0.619 | 0.412/0.668 | 0.426/0.463 | 0.465/0.615 |
| DeepGS (imp/non-imp) | 0.347/0.391 | 0.231/0.506 | 0.344/0.531 | 0.024/0.310 | 0.357/0.452 |
| Dense (imp/non-imp) | 0.359/0.449 | 0.357/0.603 | 0.401/0.657 | 0.370/0.427 | 0.434/0.612 |
| singleCNN (imp/non-imp) | 0.422/0.463 | 0.380/0.573 | 0.392/0.627 | 0.370/0.449 | 0.442/0.565 |
| rrBLUP | 0.412 | 0.392 | 0.39 | 0.413 | 0.458 |
| BRR | 0.422 | 0.392 | 0.39 | 0.413 | 0.458 |
| Bayes A | 0.419 | 0.393 | 0.388 | 0.415 | 0.458 |
| Bayesian LASSO | 0.419 | 0.394 | 0.388 | 0.416 | 0.458 |
CNN, convolutional neural network; BRR, Bayesian ridge regression.
Figure 2Training loss different deep learning models. The x-axis is number of epochs; the y-axis is the training the loss of mean absolute error (MAE) of validation dataset. The singleCNN (purple), dualCNN (blue), and Dense (green) network are conserved, and DeepGS is overfitting after 20 epochs, and our dualCNN has the lowest training loss. CNN, convolutional neural network.
Figure 3Average Pearson correlation coefficient of five traits using different sizes of training dataset. The x-axis is number of folds of training data; the y-axis is the average Pearson correlation coefficient from cross-validation.
Figure 4Comparison of genotype contribution using saliency map and GWAS Wald test of simulation (A) and experimental soybean dataset with five traits (B–F). The x-axis is the index of SNPs in the genotype matrix; the y-axis is the saliency and Wald test results. Top ranked SNPs were plotted in red. GWAS, genome-wide association study; SNP, single-nucleotide polymorphism.