| Literature DB >> 30184052 |
Manal Kalkatawi1,2, Arturo Magana-Mora1,3, Boris Jankovic1, Vladimir B Bajic1.
Abstract
MOTIVATION: Recognition of different genomic signals and regions (GSRs) in DNA is crucial for understanding genome organization, gene regulation, and gene function, which in turn generate better genome and gene annotations. Although many methods have been developed to recognize GSRs, their pure computational identification remains challenging. Moreover, various GSRs usually require a specialized set of features for developing robust recognition models. Recently, deep-learning (DL) methods have been shown to generate more accurate prediction models than 'shallow' methods without the need to develop specialized features for the problems in question. Here, we explore the potential use of DL for the recognition of GSRs.Entities:
Mesh:
Year: 2019 PMID: 30184052 PMCID: PMC6449759 DOI: 10.1093/bioinformatics/bty752
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Hypothetical gene structure and surrounding control signals of eukaryotes
Fig. 2.(A) Mononucleotide data representation. (B) Dinucleotide data representation. (C) Trinucleotide data representation. The best performance was achieved by using the trinucleotide representation
DeepGSR parameters
| Layer | Parameters | Search space |
|---|---|---|
| Conv. layer 1 | Number of filters | [20, |
| Filter length | [10, 20, | |
| Filter width | [2, 4, 8, 16, | |
| Initialization mode | [uniform, lecun_uniform, normal, | |
| Conv. layer 2 | Number of filters | 100 |
| Filter length | 10 | |
| Filter width | 8 | |
| Initialization mode | glorot_uniform | |
| Fully connected layer | Activation function | [softmax, softplus, softsign, relu, |
| Number of neurons | [32, 64, 128, | |
| Initialization mode | glorot_uniform | |
| Learning | Learning batch size | [4, |
| Optimizer | [SGD, RMSprop, Adagrad, | |
| Regularization | Dropout expectation | [0.05, |
Note: A single dropout expectation value was tuned for both dropout layers.
Parameters in bold indicate the optimized values found by using a random search algorithm.
Fig. 3.The DeepGSR model architecture using 2D-CNN. Each of the two convolutional layers uses ReLU activation function and a maxpooling layer. The input layer is a matrix of size 598 × 64 based on the trinucleotide data representation
Fig. 4.(A) Performance comparison of DeepGSR using 1D-CNN and 2D-CNN for the recognition of PAS and TIS signals in human genomic DNA. (B) Human_AATAAA_DeepGSR and Human_ATG_DeepGSR were used to test genomes of other organisms (cross-organism tests). For human data only, PAS_all represents all variants except AATAAA + only the testing portion of AATAAA (25%) that was not included in the training. (C) The results on PAS data using Human_pooled-PAS_DeepGSR for predicting PAS in other organisms. (D) The results for PAS and TIS data using DeepGSR organism specific models
Classification error reduction on the problem of PAS recognition on human
| Published models | Published error rate (%) | Reduction of the relative error rate (%) |
|---|---|---|
| DeepGSR | 13.06 | N/A |
| DPS | 16.49 | 20.80 |
| HMM-SVM | 18.59 | 29.75 |
| Omni-polyA | 14.02 | 6.85 |
| Average | – | 19.13 |
Note: Comparison between the state-of-the-art methods and DeepGSR on the problem of PAS (AATAAA) recognition on the human genome.
Classification error reduction on the problem of TIS recognition on human
| Published models | Error rate (%) | Reduction of the relative error rate (%) |
|---|---|---|
| DeepGSR | 5.68 | N/A |
| TIS-ANN | 6.72 (published) | 15.48 |
| iTIS-PseTNC | 2.08 (published) | −173.08 |
| iTIS-PseTNC | 42.32 (tested on DeepGSR data) | 86.58 |
| TITER | 21.92 (published) | 74.09 |
| TITER | 18.95 (tested on DeepGSR data) | 70.03 |
| Average | — | 57.36 |
Note: Comparison between state-of-the-art methods and DeepGSR on the problem of TIS (ATG) recognition on the human genome.