| Literature DB >> 32516876 |
Felix Heinrich1, Martin Wutke1, Pronaya Prosun Das1, Miriam Kamp1, Mehmet Gültas1,2, Wolfgang Link3, Armin Otto Schmitt1,2.
Abstract
Faba bean (Vicia faba) is a grain legume, which is globally grown for both human consumption as well as feed for livestock. Despite its agro-ecological importance the usage of Vicia faba is severely hampered by its anti-nutritive seed-compounds vicine and convicine (V+C). The genes responsible for a low V+C content have not yet been identified. In this study, we aim to computationally identify regulatory SNPs (rSNPs), i.e., SNPs in promoter regions of genes that are deemed to govern the V+C content of Vicia faba. For this purpose we first trained a deep learning model with the gene annotations of seven related species of the Leguminosae family. Applying our model, we predicted putative promoters in a partial genome of Vicia faba that we assembled from genotyping-by-sequencing (GBS) data. Exploiting the synteny between Medicago truncatula and Vicia faba, we identified two rSNPs which are statistically significantly associated with V+C content. In particular, the allele substitutions regarding these rSNPs result in dramatic changes of the binding sites of the transcription factors (TFs) MYB4, MYB61, and SQUA. The knowledge about TFs and their rSNPs may enhance our understanding of the regulatory programs controlling V+C content of Vicia faba and could provide new hypotheses for future breeding programs.Entities:
Keywords: GBS; Vicia faba; convolutional neural network; promoter; rSNP; vicin/convicin
Year: 2020 PMID: 32516876 PMCID: PMC7349281 DOI: 10.3390/genes11060614
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Number of promoter and non-promoter sequences in the sets that were used as training sets.
| Species | # Promoter Sequences | # Non-Promoter Sequences |
|---|---|---|
|
| 23,315 | 23,315 |
|
| 46,199 | 46,199 |
|
| 23,463 | 23,463 |
|
| 32,158 | 32,158 |
|
| 22,750 | 22,750 |
|
| 14,749 | 14,749 |
|
| 19,584 | 19,584 |
|
| 15,495 | 15,495 |
| - | ∼12,000 | |
| - | ∼60,000 |
Figure 1The network architecture of the CNN promoter prediction consists of four 1D-convolutional layers followed by a flattening layer and two fully-connected layers. At the end, an output layer with one neuron and a sigmoid activation function computes the probability that the analyzed sequence is classified as a promoter sequence.
ACC values of the intra- and inter-species promoter classification using the species-specific trained CNNs. Off-diagonal numbers are ACC values for inter-species classification, diagonal numbers are ACC values for intra-species classification. For instance, a CNN trained on Lupinus angustifolius and used for classification of Vigna angularis promoters has an accuracy of 0.974.
| Evaluated |
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|---|---|
| Trained | |||||||||
|
| 0.901 | 0.767 | 0.690 | 0.746 | 0.797 | 0.765 | 0.633 | 0.733 | |
|
| 0.837 | 0.864 | 0.915 | 0.847 | 0.863 | 0.724 | 0.914 | 0.856 | |
|
| 0.545 | 0.611 | 0.981 | 0.720 | 0.586 | 0.493 | 0.974 | 0.709 | |
|
| 0.755 | 0.797 | 0.959 | 0.876 | 0.789 | 0.715 | 0.951 | 0.841 | |
|
| 0.845 | 0.842 | 0.888 | 0.834 | 0.898 | 0.748 | 0.880 | 0.853 | |
|
| 0.822 | 0.764 | 0.696 | 0.751 | 0.794 | 0.840 | 0.689 | 0.736 | |
|
| 0.510 | 0.607 | 0.971 | 0.715 | 0.583 | 0.494 | 0.977 | 0.712 | |
|
| 0.741 | 0.812 | 0.937 | 0.827 | 0.825 | 0.675 | 0.928 | 0.904 | |
Contribution of additional features in the CNN model of Medicago truncatula.
| Features | Accuracy | Sensitivity | Specificity | MCC |
|---|---|---|---|---|
| DNA sequence | 0.876 | 0.897 | 0.855 | 0.750 |
| DNA sequence + 2-mer | 0.874 | 0.880 | 0.867 | 0.747 |
| DNA sequence + 2-mer + frequency of CA motif | 0.862 | 0.828 | 0.897 | 0.726 |
| DNA sequence + 2-mer + frequency of CG motif | 0.875 | 0.875 | 0.876 | 0.751 |
| DNA sequence + 2-mer + HMI | 0.874 | 0.882 | 0.865 | 0.747 |
| DNA sequence + 2-mer + frequency of TATAA motif | 0.876 | 0.875 | 0.878 | 0.752 |
| DNA sequence + 2-mer + CG skew | 0.876 | 0.889 | 0.863 | 0.753 |
| DNA sequence + topological entropy | 0.874 | 0.886 | 0.861 | 0.747 |
| DNA sequence + 2-mer + topological entropy | 0.871 | 0.852 | 0.890 | 0.743 |
| DNA sequence + 2-mer + HMI + | 0.871 | 0.869 | 0.874 | 0.743 |
| DNA sequence + 2-mer + HMI + | 0.873 | 0.859 | 0.888 | 0.747 |
| DNA sequence + HMI + | 0.875 | 0.889 | 0.860 | 0.749 |
The 14 SNPs found in the predicted promoters of Vicia faba that were mapped to the Medicago truncatula target genomic region.
| SNP_ID | Genotype | FDR | Position | Medicago Gene |
|---|---|---|---|---|
| SNP_131938_118 | C/T | 0.234 | 1,385,390 | MTR_2g008290 |
| SNP_302904_183 | G/A | 0.179 | 1,385,444 | |
| SNP_341016_236 | C/T |
| 1,554,857 | MTR_2g008620 |
| SNP_341016_239 | G/A |
| 1,554,860 | |
| SNP_356745_200 | A/G | 0.730 | 1,707,078 | MTR_2g008960 |
| SNP_280549_41 | C/T | 0.234 | 1,707,183 | |
| SNP_350273_103 | G/T | 0.234 | 1,707,199 | |
| SNP_350273_90 | A/C | 0.234 | 1,707,212 | |
| SNP_350273_61 | G/A | 0.234 | 1,912,704 | MTR_2g009430 |
| SNP_29452_204 | G/A | 0.730 | 1,912,812 | |
| SNP_29452_206 | G/A | 0.496 | 1,912,814 | |
| SNP_118828_190 | C/T | 0.234 | 2,030,017 | MTR_2g009690 |
| SNP_80231_27 | C/T | 0.234 | 2,163,048 | MTR_2g009940 |
| SNP_364434_97 | A/T | 0.359 | 2,163,084 |
Genotype refers to the reference and alternative alleles; FDR is the false discovery rate obtained in an association test with the V+C content of 20 Vicia faba lines; Position is the position in bp on the Medicago truncatula chromosome 2.
The two SNPs with the strongest association to the V+C content and their consequences. The column Allele indicates for which allele of the SNP the binding site was found. TFBS refers to the name of the binding sites, which were named after their PWMs. The structure of the PWM names is given as: P$TFname_version, where “P$” stands for the PWMs used for the prediction of the TFBSs of plant TFs. “TFname” refers to the name of the transcription factor, and “_version” refers to the version of the PWM.
| SNP_ID | Allele | TFBS | MSS | Consequence |
|---|---|---|---|---|
| SNP_341016_236 | Ref | P$MYB4_01 | 0.945 | Loss of TFBS |
| SNP_341016_236 | Ref | P$MYB61_01 | 0.880 | Loss of TFBS |
| SNP_341016_236 | Alt | P$SQUA_01 | 0.870 | Gain of TFBS |
| SNP_341016_239 | Ref | P$MYB61_01 | 0.880 | Score change |
| SNP_341016_239 | Alt | P$MYB61_01 | 0.881 | Score change |