| Literature DB >> 34037802 |
Shuai Zeng1,2, Ziting Mao1,2, Yijie Ren1,2, Duolin Wang1,2, Dong Xu1,2,3, Trupti Joshi1,2,3,4.
Abstract
G2PDeep is an open-access web server, which provides a deep-learning framework for quantitative phenotype prediction and discovery of genomics markers. It uses zygosity or single nucleotide polymorphism (SNP) information from plants and animals as the input to predict quantitative phenotype of interest and genomic markers associated with phenotype. It provides a one-stop-shop platform for researchers to create deep-learning models through an interactive web interface and train these models with uploaded data, using high-performance computing resources plugged at the backend. G2PDeep also provides a series of informative interfaces to monitor the training process and compare the performance among the trained models. The trained models can then be deployed automatically. The quantitative phenotype and genomic markers are predicted using a user-selected trained model and the results are visualized. Our state-of-the-art model has been benchmarked and demonstrated competitive performance in quantitative phenotype predictions by other researchers. In addition, the server integrates the soybean nested association mapping (SoyNAM) dataset with five phenotypes, including grain yield, height, moisture, oil, and protein. A publicly available dataset for seed protein and oil content has also been integrated into the server. The G2PDeep server is publicly available at http://g2pdeep.org. The Python-based deep-learning model is available at https://github.com/shuaizengMU/G2PDeep_model.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34037802 PMCID: PMC8262736 DOI: 10.1093/nar/gkab407
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.(A) Architecture of dual-stream CNN model. The genotypes are one-hot coded. The layers in the left stream are two CNN layers with kernel sizes of 4 and 20, respectively, and the same number of filters 10. The layer in the right stream is a single CNN layer with a kernel size of 4 and number of filters 10. The add-up layer aggregates the output from two streams, followed by a single CNN layer with a kernel size of 4 and number of filters 10. The fully connected layers with numbers of neurons 512 and 1 are regression blocks to predict quantitative phenotype. (B) Flowchart of genomic markers discovery using a well-trained model and saliency map. The test dataset is used to estimate the marker significance. For each sample in the test dataset, the saliency map and well-trained model are used to estimate saliency values. The marker significance is calculated by the mean saliency value for each marker position.
Figure 2.Illustration of G2PDeep architecture. The architecture consists of four modules and these modules communicate with each other via appropriate APIs.
Pearson correlation coefficient of models on five datasets from the SoyNAM dataset. The dualCNN is the model currently used in G2PDeep. DeepGS is a model combining CNN and a fully connected neural network. rrBLUP is ridge regression with a relationship matrix and Gaussian kernel. BRR is Bayesian ridge regression. Bayesian LASSO is Bayesian regression with an L2 penalty.
| Pearson correlation coefficient | |||||
|---|---|---|---|---|---|
| Model | Yield | Protein | Oil | Moisture | Height |
| dualCNN |
|
|
|
|
|
| DeepGS | 0.391 | 0.506 | 0.531 | 0.310 | 0.452 |
| rrBLUP | 0.412 | 0.392 | 0.39 | 0.413 | 0.458 |
| BRR | 0.422 | 0.392 | 0.39 | 0.413 | 0.458 |
| Bayesian LASSO | 0.419 | 0.394 | 0.388 | 0.416 | 0.458 |
Pearson correlation coefficient of models on two datasets from the Bandillo's dataset. The dualCNN is the model currently used in G2PDeep. The DeepGS is a model combining CNN and a fully connected neural network. rrBLUP is ridge regression with a relationship matrix and Gaussian kernel. BRR is Bayesian ridge regression. Bayesian LASSO is Bayesian regression with the L2 penalty.
| Pearson correlation coefficient | ||
|---|---|---|
| Model | Protein | Oil |
| dualCNN |
|
|
| DeepGS | 0.453 | 0.543 |
| rrBLUP | 0.434 | 0.533 |
| BRR | 0.443 | 0.521 |
| Bayesian LASSO | 0.412 | 0.534 |
Figure 3.Dataset creation and retrieval in G2PDeep. (A) An example of an uploading file by a shared link to data. Both dataset name and link are required. (B) The uploaded dataset and publicly available dataset are shown with metadata. (C) Details of the dataset including the number of features and number of SNPs in the training and validation datasets.
Figure 4.Project section in G2PDeep. (A) Interactive chart to configure the deep-learning model. (B) Learning curve showing losses and metrics for each epoch. The scatter plot of predicted and true quantitative phenotypes for training and validation dataset. (C) An example of comparison results between two models. The Pearson correlation coefficient, R-squared, mean absolute error and mean squared error are shown in the form of a table.
Figure 5.Results of predicted quantitative phenotypes and marker significance. The saliency map shows that the marker highly associated with phenotype is located in around 3000 SNP index.