| Literature DB >> 35970979 |
Nicholas Dominic1, Tjeng Wawan Cenggoro2,3, Arif Budiarto2,3, Bens Pardamean4,5.
Abstract
As the fourth most populous country in the world, Indonesia must increase the annual rice production rate to achieve national food security by 2050. One possible solution comes from the nanoscopic level: a genetic variant called Single Nucleotide Polymorphism (SNP), which can express significant yield-associated genes. The prior benchmark of this study utilized a statistical genetics model where no SNP position information and attention mechanism were involved. Hence, we developed a novel deep polygenic neural network, named the NucleoNet model, to address these obstacles. The NucleoNets were constructed with the combination of prominent components that include positional SNP encoding, the context vector, wide models, Elastic Net, and Shannon's entropy loss. This polygenic modeling obtained up to 2.779 of Mean Squared Error (MSE) with 47.156% of Symmetric Mean Absolute Percentage Error (SMAPE), while revealing 15 new important SNPs. Furthermore, the NucleoNets reduced the MSE score up to 32.28% compared to the Ordinary Least Squares (OLS) model. Through the ablation study, we learned that the combination of Xavier distribution for weights initialization and Normal distribution for biases initialization sparked more various important SNPs throughout 12 chromosomes. Our findings confirmed that the NucleoNet model was successfully outperformed the OLS model and identified important SNPs to Indonesian rice yields.Entities:
Mesh:
Year: 2022 PMID: 35970979 PMCID: PMC9378700 DOI: 10.1038/s41598-022-16075-9
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Number of SNPs for each chromosome.
Figure 2Data preprocessing step.
Figure 3The NucleoNet model.
The prominent parts of the NucleoNet model.
| No. | Component in model | Purpose |
|---|---|---|
| 1 | Positional encoding[ | Add SNP position information to the primary SNP data |
| 2 | The context vector[ | As the attention mechanism, to emit the SNP importance value |
| 3 | Wide model[ | Accommodate all covariates |
| 4 | Elastic net[ | Penalize all parameters in all layers |
| 5 | Entropy loss[ | Control the distribution of attention scores across all SNPs |
Tensor size for each layer in the NucleoNets. In this table, indicates the batch size, indicates the length of SNP, indicates the embedding size, indicates the number of attention hidden layers, indicates the number of sample locations, indicates the number of sample varieties, means the MLP hidden layer of the deep model, means the MLP hidden layer of the wide model, and FC means the Fully Connected layer.
| Deep model | Size | Wide model | Size | Wide deep model | Size |
|---|---|---|---|---|---|
| SNP data input ( | Sample location data input ( | Concat | |||
| SNP data embedding | Sample location one hot encoding | FC3 | |||
| SNP position input ( | Sample location embedding | Output ( | |||
| SNP position embedding | Sample location flatten | ||||
| SNP data + position ( | Wide model 1 ( | ||||
| Attention layer 1 ( | Sample variety data input ( | ||||
| Attention layer 2 ( | Sample variety one hot encoding | ||||
| Context vector ( | Sample variety embedding | ||||
| Concatenation (GAP) | Sample variety flatten | ||||
| FC1 | Wide model 2 ( | ||||
| FC2 ( |
NucleoNets model comparison with other models. ✓: This symbol means the related part is available in the model. ✖: This symbol means the related part is unavailable in the model. *Not mentioned in the original paper[6]. **The Scikit-learn library does not support the p-value calculation. On the contrary, the Stasmodels library does not have an ENET function. ***NucleoNets results from ABST-6.
| Polygenic model | GGDPR | OLS | OLS + ENET | NucleoNetV1 | NucleoNetV2 | NucleoNetV3 | Wide and deep model |
|---|---|---|---|---|---|---|---|
| Total Indonesian rice SNPs | 1232 | 1232 | 1232 | 1232 | 1232 | 1232 | 1232 |
| SNP data | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| SNP position data | ✖ | ✖ | ✖ | ✖ | ✖ | ✖ | ✓ |
| Covariate: sample location | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Covariate: sample variety | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Shrinkage prior/regularization | Generalized double pareto | ✖ | ENET | ✖ | Modified ENET | Modified ENET | Modified ENET |
| Shannon’s entropy | ✖ | ✖ | ✖ | ✖ | ✖ | ✓ | ✓ |
| Evaluation: MSE | N/A* | 4.104 | 2.517 | 2.779*** | 2.799*** | 2.863*** | 8.535 |
| Evaluation: RMSE | N/A* | 2.026 | 1.587 | 1.667 | 1.673 | 1.692 | 2.921 |
| Evaluation: MBE | N/A* | − 0.236 | − 0.404 | 0.099 | 0.015 | − 0.074 | − 2.148 |
| Evaluation: MAE | N/A* | 1.673 | 1.321 | 1.407 | 1.412 | 1.433 | 2.497 |
| Evaluation: MSLE | N/A* | 0.286 | 0.185 | 0.184 | 0.191 | 0.197 | 0.468 |
| Evaluation: SMAPE | N/A* | 64.843% | 45.432% | 47.156% | 47.960% | 47.481% | 63.668% |
| Significance/importance level | N/A* | N/A** | N/A | ||||
| Number of significant/important SNP | 9 | 16 | N/A** | 29 | 35 | 23 | N/A |
| Execution time | N/A* | < 2 s | < 2 s | 1630 s | 5120 s | 4910 s | 6070 s |
Figure 4Ablation study results testing for one random sample.
Figure 5The NucleoNets training plots.
Figure 6NucleoNetV3 testing results under different seeds.
Figure 7Important SNPs emitted per attention score.
Important SNPs found in the NucleoNets. Chr:Pos means Chromosome:Position. Suffix in each SNP denotes its alternate allele. *Intronic. **Intergenic.
| Model | SNP name | Chr:Pos | NucleoNets | Marginal regression | Full regression | |||
|---|---|---|---|---|---|---|---|---|
| Count | ||||||||
| NucleoNetV1 | TBGI336584_T* | 7:28,902,549 | 104 | 0.349702 | 0.692976 | 0.367086 | 0.613405 | − 0.00464 |
| TBGI139174_C* | 3:10,546,292 | 100 | 0.078781 | 0.501128 | 0.118872 | 0.258786 | − 0.05250 | |
| TBGI043687_A* | 1:27,033,613 | 98 | 0.039402 | 0.461979 | 0.092519 | 0.749955 | 0.018242 | |
| TBGI047097_A* | 1:29,101,182 | 87 | 0.043968 | 0.245114 | 0.146880 | 0.731616 | − 0.00822 | |
| id2008820_T* | 2:23,034,401 | 48 | 0.028928 | 0.293053 | 0.133663 | 0.487864 | − 0.15724 | |
| NucleoNetV2 | id4010708_C | 4:31,871,929 | 76 | 0.334360 | 0.023139 | 0.178155 | 0.181538 | 0.092289 |
| TBGI133654_T* | 3:6,221,117 | 71 | 0.073753 | 0.981030 | − 0.00224 | 0.051080 | − 0.11139 | |
| TBGI133263_A** | 3:5,884,040 | 64 | 0.057674 | 0.554272 | 0.060059 | 0.616267 | 0.035691 | |
| id1010403_T* | 1:16,716,706 | 53 | 0.040871 | 0.275980 | 0.377068 | 0.725071 | 0.007040 | |
| TBGI272488_T* | 6:3,001,902 | 34 | 0.363929 | 0.451712 | 0.057712 | 0.725524 | 0.014053 | |
| NucleoNetV3 | id10004275_C | 10:16,252,942 | 102 | 0.050838 | 0.523674 | − 0.37561 | 0.373641 | 0.050556 |
| TBGI264076_A* | 5:27,953,016 | 91 | 0.125639 | 0.90349 | 0.018688 | 0.611320 | − 0.01367 | |
| TBGI130922_G** | 3:4,441,747 | 75 | 0.032907 | 0.356457 | − 0.07551 | 0.933317 | − 0.00536 | |
| TBGI038001_C* | 1:23,689,014 | 73 | 0.133440 | 0.564393 | − 0.04618 | 0.195798 | − 0.06157 | |
| TBGI336599_C* | 7:28,905,733 | 73 | 0.043163 | 0.930258 | − 0.00685 | 0.535020 | − 0.03080 | |
The NHST results.
| Main model | Comparison model | t-test | Validation | Conclusion | Description | |
|---|---|---|---|---|---|---|
| NucleoNetV1 | OLS | Two-tailed | Reject | Proceed to a one-tailed t-test | ||
| 1. |t-stat|> t-table | Is |− 2.998|> 2.026? | TRUE | ||||
| 2. | Is 0.003 < 0.025? | TRUE | ||||
| One-tailed (less than) | Reject | The Indonesian rice yields prediction performance of the NucleoNetV1 model outperformed the OLS model | ||||
| 1. t-stat < t-table | Is − 2.998 < − 1.687? | TRUE | ||||
| 2. | Is 0.002 < 0.05? | TRUE | ||||
| One-tailed (greater than) | Reject | |||||
| 1. t-stat > t-table | Is − 2.998 > 1.687? | FALSE | ||||
| 2. | Is 0.998 < 0.05? | FALSE | ||||
| OLS + ENET | Two-tailed | Reject | The Indonesian rice yields prediction performance of the NucleoNetV1 model has no difference from the OLS + ENET model | |||
| 1. |t-stat|> t-table | Is |− 1.028|> 2.026? | FALSE | ||||
| 2. | Is 0.311 < 0.025? | FALSE | ||||
| One-tailed (less than) | – | |||||
| – | – | – | ||||
| One-tailed (greater than) | – | |||||
| – | – | – | ||||
| NucleoNetV2 | OLS | Two-tailed | Reject | Proceed to a one-tailed t-test | ||
| 1. |t-stat|> t-table | Is |− 2.753|> 2.026? | TRUE | ||||
| 2. | Is 0.091 < 0.025? | FALSE | ||||
| One-tailed (less than) | Reject | The Indonesian rice yields prediction performance of the NucleoNetV2 model outperformed the OLS model | ||||
| 1. t-stat < t-table | Is − 2.753 < − 1.687? | TRUE | ||||
| 2. | Is 0.005 < 0.05? | TRUE | ||||
| One-tailed (greater than) | Reject | |||||
| 1. t-stat > t-table | Is − 2.753 > 1.687? | FALSE | ||||
| 2. | Is 0.995 < 0.05? | FALSE | ||||
| OLS + ENET | Two-tailed | Reject | The Indonesian rice yields prediction performance of the NucleoNetV2 model has no difference from the OLS + ENET model | |||
| 1. |t-stat|> t-table | Is |− 1.027|> 2.026? | FALSE | ||||
| 2. | Is 0.311 < 0.025? | FALSE | ||||
| One-tailed (less than) | – | |||||
| – | – | – | ||||
| One-tailed (greater than) | – | |||||
| – | – | – | ||||
| NucleoNetV3 | OLS | Two-tailed | Reject | Proceed to a one-tailed t-test | ||
| 1. |t-stat|> t-table | Is |− 2.937|> 2.026? | TRUE | ||||
| 2. | Is 0.006 < 0.025? | TRUE | ||||
| One-tailed (less than) | Reject | The Indonesian rice yields prediction performance of the NucleoNetV3 model outperformed the OLS model | ||||
| 1. t-stat < t-table | Is − 2.937 < − 1.687? | TRUE | ||||
| 2. | Is 0.003 < 0.05? | TRUE | ||||
| One-tailed (greater than) | Reject | |||||
| 1. t-stat > t-table | Is − 2.937 > 1.687? | FALSE | ||||
| 2. | Is 0.997 < 0.05? | FALSE | ||||
| OLS + ENET | Two-tailed | Reject | The Indonesian rice yields prediction performance of the NucleoNetV3 model has no difference from the OLS + ENET model | |||
| 1. t-stat < t-table | Is |− 0.743|> 2.026? | FALSE | ||||
| 2. | Is 0.462 < 0.025? | FALSE | ||||
| One-tailed (less than) | – | |||||
| – | – | – | ||||
| One-tailed (greater than) | – | |||||
| – | – | - | ||||