| Literature DB >> 35663929 |
Noopur Singh1,2, Ravindra Nath2, Dev Bukhsh Singh3,4.
Abstract
Machine learning methods played a major role in improving the accuracy of predictions and classification of DNA (Deoxyribonucleic Acid) and protein sequences. In eukaryotes, Splice-site identification and prediction is though not a straightforward job because of numerous false positives. To solve this problem, here, in this paper, we represent a bidirectional Long Short Term Memory (LSTM) Recurrent Neural Network (RNN) based deep learning model that has been developed to identify and predict the splice-sites for the prediction of exons from eukaryotic DNA sequences. During the splicing mechanism of the primary mRNA transcript, the introns, the non-coding region of the gene are spliced out and the exons, the coding region of the gene are joined. This bidirectional LSTM-RNN model uses the intron features that start with splice site donor (GT) and end with splice site acceptor (AG) in order of its length constraints. The model has been improved by increasing the number of epochs while training. This designed model achieved a maximum accuracy of 95.5%. This model is compatible with huge sequential data such as the complete genome.Entities:
Keywords: ANN, Artificial Neural Network; Bidirectional LSTM-RNN; CDS, Coding Sequence; DNA, Deoxyribonucleic Acid; Deep learning; Exon; Intron; LSTM-RNN, Long Short-Term Memory Recurrent Neural Network; Machine learning; RNA, Ribonucleic Acid; Splice-site
Year: 2022 PMID: 35663929 PMCID: PMC9157471 DOI: 10.1016/j.bbrep.2022.101285
Source DB: PubMed Journal: Biochem Biophys Rep ISSN: 2405-5808
Fig. 1The architecture of Bidirectional LSTM in which both flow of backward and forward information is shown by the directed arrows in the hidden layer; w0, w1, w2 and wn represent the input and y0, y1, y2 and yn represent the output respectively.
Fig. 2Proposed work plan using bidirectional LSTM-RNN model for splice site identification and prediction: Here in this figure, inputs for the model are the DNA sequences or may take complete genome. Then these DNA sequences are processed for ORF prediction; then these ORFs are then converted into categorical numeric format; Then these prepared datasets are passed through a Bidirectional LSTM-RNN model that consists of an embedding layer, a dropout layer, a bidirectional LSTM layer and a dense layer; after processing the output are donor, acceptor and no sites that is the identified and predicted regions.
Result of ORFfinder for reference genome of Cryptosporidium parvum lowa II, median total length 9.1089 Mb and median GC% 30.2 (https://www.ncbi.nlm.nih.gov/genome/?term=cryptosporidium+parvum).
| Chromosome no. | RefSeq (NCBI) | Size (Mb) | No. of ORFs |
|---|---|---|---|
| 0.88 | 87 | ||
| 0.99 | 95 | ||
| 1.1 | 93 | ||
| 1.1 | 89 | ||
| 1.08 | 100 | ||
| 1.33 | 110 | ||
| 1.28 | 117 | ||
| 1.34 | 94 |
Summarized bidirectional LSTM-RNN Model.
Fig. 3Bidirectional LSTM-RNN model representing each layer.
Training data comprising of 80% of total prepared dataset (80% of 111015).
| Train_X | (88812, 60) |
|---|---|
| (88812, 3) |
Fig. 4Loss curve of the model shows high training and test loss at beginning that gradually decreases and flattens thus proving a good fit model. As the curves of traing and test are very close, this is also a mark of a good fit model.
Fig. 5Accuracy curve of the model shows that the train and test curve are very close to each other thus proving a good fit model and pretty high accuracy.
Testing data comprising 20% of total prepared dataset (20% of 111015).
| Test_X | (22203,60) |
|---|---|
| (22203, 3) |
Fig. 6Test accuracy of the Bidirectional LSTM-RNN Model representing a test accuracy of 95.5% and loss 15.7%.
Result of predicted exons and introns by the model with genome annotation.
| Predicted | Genome Annotation [ | |||
|---|---|---|---|---|
| Exons | Introns | Exons | Introns | |
| 4325 | 650 | 4553 | 688 | |
| 1453 | 95 | 1514 | 99 | |
Accuracy of proposed model and others.
| Deep Belief Networks [ | Unidirectional LSTM [ | LSTM-RNN [ | Bidirectional LSTM-RNN (proposed approach) | |
|---|---|---|---|---|
| 0.888 | 0.820 | 0.943 |