| Literature DB >> 35513802 |
Xiaodan Zhang1,2, Jinxiang Xuan1,2, Chensong Yao3, Qijuan Gao1, Lianglong Wang1,2, Xiu Jin4,5, Shaowen Li6,7.
Abstract
BACKGROUND: Orphan gene play an important role in the environmental stresses of many species and their identification is a critical step to understand biological functions. Moso bamboo has high ecological, economic and cultural value. Studies have shown that the growth of moso bamboo is influenced by various stresses. Several traditional methods are time-consuming and inefficient. Hence, the development of efficient and high-accuracy computational methods for predicting orphan genes is of great significance.Entities:
Keywords: Convolutional neural network; Deep learning; Moso bamboo; Orphan genes; Transformer neural network
Mesh:
Substances:
Year: 2022 PMID: 35513802 PMCID: PMC9069780 DOI: 10.1186/s12859-022-04702-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1Flow chart of data acquisition for moso bamboo OG
Amino acid embedding cross-reference table
| Amino acids | Letters | Code | Amino acids | Letters | Code |
|---|---|---|---|---|---|
| Histidine | H | 1 | Methionine | M | 2 |
| Alanine | A | 3 | Lysine | K | 4 |
| Cysteine | C | 5 | Arginine | R | 6 |
| Lucine | L | 7 | Tyrosine | Y | 8 |
| Serine | S | 9 | Aspartic | D | 10 |
| Glycine | G | 11 | Valine | V | 12 |
| Isoleucine | I | 13 | Glutamic | E | 14 |
| Asparagine | N | 15 | Tryptophan | W | 16 |
| Phenylalanine | F | 17 | Threonine | T | 18 |
| Glutamine | Q | 19 | Proline | P | 20 |
| Illegal Amino acids | B,J,O,U,X,Z | 21 |
Fig. 2The structure of the encoder and decoder in the transformer model
Fig. 3CNN + Transformer model structure. The discrete raw sequence is transformed into a dense, continuous vector Fe through feature embedding and then fed into the CNN layer with multi-scale convolution kernels to capture local amino acid k-mers features. The extracted characteristic map of the CNN layer is passed to Transformer neural network. According the multi-head self attention mechanism to capture the long-range interaction characteristics between k-mers. Finally, the Transformer outputs are passed to the fully connected layers to produce identification result
Division and construction of datasets
| Data | OG | Non-OG | Total |
|---|---|---|---|
| Original set | 1544 | 30,443 | 31,987 |
| Training set | 1071 | 21,317 | 22,388 |
| Validation set | 235 | 4566 | 4801 |
| Testing set | 238 | 4560 | 4798 |
Fig. 4Sequence length distribution in original set, training set, validation set, and testing set
The average BA and GM values under different CNN + Transformer structures
| Method | BA | GM | Train time (min) | Test time (s) |
|---|---|---|---|---|
| E + FC_256 | 0.677 | 0.612 | 25 | 88 |
| E + CNN_6 + FC_256 | 0.748 | 0.644 | 29 | 106 |
| E + CNN_6 + CNN_3 + FC_256 | 0.773 | 0.784 | 28 | 99 |
| E + Transformer + FC_256 | 0.844 | 0.832 | 57 | 390 |
| E + CNN_6 + Transformer + FC_256 | 0.866 | 0.849 | 61 | 377 |
| E + CNN_6 + CNN_3 + Transformer + FC_256 | 44 | 342 | ||
| E + CNN_6 + CNN_3 + CNN_3 + Transformer + FC_256 | 0.853 | 0.837 | 48 | 366 |
| E + CNN_6 + CNN_3 + Transformer + FC_256 + FC_256 | 0.871 | 0.865 | 55 | 404 |
Bold values indicate the highest values of different evaluation indicators
E: word embedding coding. CNN_6: multiscale convolution layer, with a convolution kernel size of {2, 3, 4, 5, 6, 7} for each scale. CNN_3: multiscale convolution layer, the convolution kernel size of each scale is {3, 6, 9}. Transformer: three-layer transformer neural network. FC_256: fully connected neural network with 256 neurons
Fig. 5Comparison of CNN + Transformer performance with different Max_len values
Fig. 6Comparison of CNN + Transformer performance with different Embedding_dim values
Fig. 7Performance comparison of CNN + Transformer under different N-head and Num-layer combinations. L: transformer encoder-decoder number; H: number of heads of the self-attention mechanism
Model performance comparison
| Model | BA | GM | BM | MCC | Train time (min) | Test time (s) |
|---|---|---|---|---|---|---|
| Random forest | 0.667 | 0.629 | 0.334 | 0.227 | 4 | 22 |
| SVM | 0.690 | 0.659 | 0.380 | 0.252 | 23 | 77 |
| RNN | 0.517 | 0.512 | 0.034 | 0.245 | 31 | 123 |
| CNN + RNN | 0.503 | 0.500 | 0.007 | 0.109 | 18 | 98 |
| LSTM | 0.829 | 0.829 | 0.659 | 0.418 | 36 | 284 |
| CNN + LSTM | 0.775 | 0.772 | 0.550 | 0.376 | 26 | 231 |
| GRU | 0.838 | 0.834 | 0.667 | 0.423 | 33 | 253 |
| CNN + GRU | 0.777 | 0.776 | 0.554 | 0.373 | 24 | 219 |
| Transformer | 0.844 | 0.838 | 0.678 | 0.444 | 57 | 387 |
| CNN + Transformer | 44 | 343 |
Bold values indicate the highest values of different evaluation indicators
Fig. 8Functional classification and enrichment of OGs (Dr.Tom, BGI, China). A GO (Gene Ontology) classification of OGs. The vertical axis represents the GO terms, and the horizontal axis represents the number of OGs. B Bubble graph for GO enrichment (the bigger bubble means the more genes enriched, and the increasing depth of blue means the differences were more obvious; q-value: the adjusted p-value). C KEGG (Kyoto Encyclopedia of Genes and Genomes) classification of OGs. D Bubble graph for KEGG enrichment