| Literature DB >> 35789598 |
Garima Mathur1, Anjana Pandey2, Sachin Goyal2.
Abstract
In the current pandemic situation where the coronavirus is spreading very fast that can jump from one human to another. Along with this, there are millions of viruses for example Ebola, SARS, etc. that can spread as fast as the coronavirus due to the mobilization and globalization of the population and are equally deadly. Earlier identification of these viruses can prevent the outbreaks that we are facing currently as well as can help in the earlier designing of drugs. Identification of disease at a prior stage can be achieved through DNA sequence classification as DNA carries most of the genetic information about organisms. This is the reason why the classification of DNA sequences plays an important role in computational biology. This paper has presented a solution in which samples collected from NCBI are used for the classification of DNA sequences. DNA sequence classification will in turn gives the pattern of various diseases; these patterns are then compared with the samples of a newly infected person and can help in the earlier identification of disease. However, feature extraction always remains a big issue. In this paper, a machine learning-based classifier and a new technique for extracting features from DNA sequences based on a hot vector matrix have been proposed. In the hot vector representation of the DNA sequence, each pair of the word is represented using a binary matrix which represents the position of each nucleotide in the DNA sequence. The resultant matrix is then given as an input to the traditional CNN for feature extraction. The results of the proposed method have been compared with 5 well-known classifiers namely Convolution neural network (CNN), Support Vector Machines (SVM), K-Nearest Neighbor (KNN) algorithm, Decision Trees, Recurrent Neural Networks (RNN) on several parameters including precision rate and accuracy and the result shows that the proposed method gives an accuracy of 93.9%, which is highest compared to other classifiers.Entities:
Keywords: Classifier; Convolution neural network (CNN); DNA sequence; Decision trees; Feature extraction; K-nearest neighbor (KNN) algorithm; Recurrent neural networks (RNN); Support vector machines (SVM)
Year: 2022 PMID: 35789598 PMCID: PMC9243743 DOI: 10.1007/s12652-022-04099-y
Source DB: PubMed Journal: J Ambient Intell Humaniz Comput
Fig. 1DNA base pair structure with the sugar-phosphate backbone
Fig. 2Number of samples in a dataset
DNA base pairs
| A | Adenosine |
|---|---|
| C | Cytosine |
| G | Guanine |
| T | Thymidine |
| N | Can be any of these (A, C, G, T) |
Fig. 3Pattern recognition model
Fig. 4a Training. b Prediction
Fig. 5a Hyperlane in 2D. b Classification of vector and tags. c Best hyperplane
Fig. 6a Structure of an artificial neuron. b Feature extraction by CNN
Fig. 7Decision Tree
The research was done in the field of DNA sequence classification
| S. no. | Paper | Description |
|---|---|---|
| 1 | DNA sequence classification via an expectation–maximization algorithm and neural networks (Wang. et al. | In the paper, to recognize E. Coli promoters found in DNA and also to determine whether the given DNA sequence is E. Coli promoters or not a new technique has been introduced. This paper uses the EM algorithm for locating binding sites in the E. Coli promoter sequence which is better than previous algorithms as it reduces the Probability distribution of lengths. After locating the binding sites, feature selection will be done in each sequence according to the information content available and then represented it using the orthogonal encoding method. Finally, for promoter recognition, features are inserted into a neural network |
| 2 | Vector space classification of DNA sequences (Müller. et al. | This paper highlights the issues associated with the identification of intron and exon. For DNA sequence classification they use PCA. To represent word content, sequences are converted into document vectors. Finally, these word contents are used for the classification of sequences. This approach has been tested over many data sets of DNA for the classification of intron–exon and gives the highest accuracy as compared to other approaches |
| 3 | DNA sequence classification using DAWGs (Levy. et al. | Assigning words or substrings in a given sequence to generate a unique sequence class is said to be known as DNA sequence classification. To classify an unknown sequence, all its words are compared in the dictionary so that they can be represented in DNA’s basic three classes. In this work, efficiency is increased by the construction of DAWGs i.e. directed acyclic word graphs. With this method, it is possible to identify 94% of a test set and only 4% of failures were there |
| 4 | Classification of gene expression data using fuzzy logic (Ohno-Machado et al. | Technologies like microarray have permitted the estimation of numerous qualities of gene expression levels at the same time. These generated levels can be further utilized for the classification of tissues into a prognostic category or diagnostic category. As estimations from various microarray innovations are made on various scales, it can be very useful to create a simple and easy-to-understand classification scheme that is technology-independent. This paper highlights how fuzzy logic can be useful to capture issues involved in the classification of gene expression data with 2 examples. However, in terms of classification performance, fuzzy inference performance is the same as that of other classifiers but it is simple and easy to understand |
| 5 | Complementary classification approaches for protein sequences (Wang et al. | In this paper, they have studied 5 methods of protein classification and have applied them to proteins that are related to each other and belong to the PROSITE catalog. Out of them, 4 methods are based on a search into the database (block-based approach) and the remaining are based on repeatedly appearing motifs present in a protein family. They have concluded that when talking about amino acids that occur in blocks, the block-based approach is considered the most suitable method, and using all these 5 methods can give a good classification result |
| 6 | Review on the Application of Machine Learning Algorithms in the Sequence Data Mining of DNA (Yang et al. | This paper focuses on the basic definition of DNA and the use of machine learning techniques for the mining of DNA sequences. They have also analyzed the basic concepts of data mining, various ML algorithms, and problem faced by them during sequence mining. A review has been done on how DNA sequencing technologies are growing exponentially, the structure of the sequence, and their similarity |
| 7 | DNA sequence classification by convolutional neural network (Nguyen et al. | In this work, a convolution neural network that uses a convolution layer for extracting features from given input data is used. The previous layer’s extracted features are used by the convolution layer’s neurons for extracting high-level abstraction features. In this method, DNA data is treated as text and then CNN is applied to them, just like CNN is applied to the text. They have used twelve datasets of DNA sequences and proved that CNN provides the best solution for solving a sequence classification problem |
| 8 | Deep learning architectures for DNA sequence classification (Bosco et al. | Generic computation is used for medical-related data analysis, DNA classification is considered one of the important tasks and ML (machine learning) techniques are successful in doing this task. However, the problem that still exists is feature selection. Machine learning methods highly depends on feature selection but the selection of meaningful feature is another important task. Deep learning models have already proven themselves in extracting useful features from given input patterns. This work highlights 5 different classification methods done on public DNA sequence data and also introduces 2 deep learning models |
| 9 | A neural network-based multi-classifier system for gene identification in DNA sequences (Ranawana et al. | Intending to identify promoter sequence (E.coli), this article proposes a multi-classifier system that is based on a neural network. This is because before every gene there is a promoter seq. so, for successful identification of DNA sequence gene, it is important to locate the E.coli promoter. This multi classifier system has been tested over different promoter as well as non-promoter sequences and the result shows that it gives the best prediction than other systems that have been developed so far |
Fig. 8Hot vector representation of words
Fig. 9Dictionary of words
Fig. 10Hot vector representation of FASTA DNA sequence
Fig. 11Proposed framework
Fig. 12The overall architecture used in result visualization
The evaluation result of CNN
| Parameters | Fold 0 | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Average | |
|---|---|---|---|---|---|---|---|
| CNN | Sn | 60.0 | 60.0 | 45.0 | 45.0 | 50.0 | 52.0 |
| Sp | 95.0 | 95.0 | 90.0 | 95.0 | 100.0 | 95.0 | |
| Pre | 92.31 | 92.31 | 81.82 | 90.0 | 100.0 | 91.288 | |
| Acc | 77.5 | 77.5 | 67.5 | 70.0 | 75.0 | 73.5 | |
| MCC | 0.5871 | 0.5871 | 0.3919 | 0.4619 | 0.5774 | 0.5211 | |
| F1 | 0.7273 | 0.7273 | 0.5806 | 0.6 | 0.6667 | 0.6604 | |
| AUROC | 0.7575 | 0.8525 | 0.8375 | 0.765 | 0.795 | 0.8015 | |
| AUPRC | 0.827 | 0.8702 | 0.836 | 0.7893 | 0.8574 | 0.836 |
The evaluation result of the Decision Tree
| Parameters | Fold 0 | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Average | |
|---|---|---|---|---|---|---|---|
| Decision tree | Sn | 50.0 | 70.0 | 55.0 | 70.0 | 65.0 | 62.0 |
| Sp | 55.0 | 65.0 | 70.0 | 60.0 | 65.0 | 63.0 | |
| Pre | 52.63 | 66.67 | 64.71 | 63.64 | 65.0 | 62.53 | |
| Acc | 52.5 | 67.5 | 62.5 | 65.0 | 65.0 | 62.5 | |
| MCC | 0.0501 | 0.3504 | 0.2529 | 0.3015 | 0.3 | 0.251 | |
| F1 | 0.5128 | 0.6829 | 0.5946 | 0.6667 | 0.65 | 0.6214 | |
| AUROC | 0.525 | 0.675 | 0.625 | 0.65 | 0.65 | 0.625 | |
| AUPRC | 0.6382 | 0.7582 | 0.711 | 0.7432 | 0.7375 | 0.7176 |
The evaluation result of MLP
| Parameters | Fold 0 | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Average | |
|---|---|---|---|---|---|---|---|
| MLP | Sn | 70.0 | 80.0 | 80.0 | 85.0 | 80.0 | 79.0 |
| Sp | 90.0 | 70.0 | 95.0 | 55.0 | 75.0 | 77.0 | |
| Pre | 87.5 | 72.73 | 94.12 | 65.38 | 76.19 | 79.184 | |
| Acc | 80.0 | 75.0 | 87.5 | 70.0 | 77.5 | 78.0 | |
| MCC | 0.6124 | 0.5025 | 0.7586 | 0.4193 | 0.5507 | 0.5687 | |
| F1 | 0.7778 | 0.7619 | 0.8649 | 0.7391 | 0.7805 | 0.7848 | |
| AUROC | 0.8075 | 0.845 | 0.8975 | 0.755 | 0.8225 | 0.8255 | |
| AUPRC | 0.8183 | 0.8608 | 0.9232 | 0.7375 | 0.8339 | 0.8347 |
The evaluation result of the RNN
| Parameters | Fold 0 | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Average | |
|---|---|---|---|---|---|---|---|
| RNN | Sn | 45.0 | 60.0 | 55.0 | 30.0 | 25.0 | 43.0 |
| Sp | 95.0 | 95.0 | 85.0 | 100.0 | 100.0 | 95.0 | |
| Pre | 90.0 | 92.31 | 78.57 | 100.0 | 100.0 | 92.176 | |
| Acc | 70.0 | 77.5 | 70.0 | 65.0 | 62.5 | 69.0 | |
| MCC | 0.4619 | 0.5871 | 0.4193 | 0.4201 | 0.378 | 0.4533 | |
| F1 | 0.6 | 0.7273 | 0.6471 | 0.4615 | 0.4 | 0.5672 | |
| AUROC | 0.815 | 0.885 | 0.8975 | 0.785 | 0.7575 | 0.828 | |
| AUPRC | 0.8662 | 0.8647 | 0.8864 | 0.8125 | 0.7998 | 0.8459 |
The evaluation result of the SVM
| Parametrs | Fold 0 | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Average | |
|---|---|---|---|---|---|---|---|
| SVM | Sn | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Sp | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | |
| Pre | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
| Acc | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | |
| MCC | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
| F1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
| AUROC | 0.1225 | 0.115 | 0.1 | 0.2475 | 0.195 | 0.156 | |
| AUPRC | 0.324 | 0.3222 | 0.3203 | 0.3607 | 0.3427 | 0.334 |
The evaluation result of the Proposed Method
| Parameters | Fold 0 | Fold 1 | Fold 2 | Fold 3 | Fold 4 | Average | |
|---|---|---|---|---|---|---|---|
| Proposed method | Sn | 60.0 | 60.0 | 45.0 | 45.0 | 50.0 | 52.0 |
| Sp | 95.0 | 95.0 | 90.0 | 95.0 | 100.0 | 95.0 | |
| Pre | 92.31 | 92.31 | 81.82 | 100.0 | 100.0 | 93.288 | |
| Acc | 93.5 | 92.5 | 97.5 | 91.0 | 95.0 | 93.9 | |
| MCC | 0.5871 | 0.5871 | 0.3919 | 0.4619 | 0.5774 | 0.5211 | |
| F1 | 0.7273 | 0.7273 | 0.5806 | 0.6 | 0.6667 | 0.6604 | |
| AUROC | 0.7575 | 0.8525 | 0.8375 | 0.765 | 0.795 | 0.8015 | |
| AUPRC | 0.827 | 0.8702 | 0.836 | 0.7893 | 0.8574 | 0.836 |
Fig. 13a CNN clustering. b Decision tree clustering. c RNN clustering. d SVM Clustering. e MLP clustering. f Proposed method
Comparison with the previous best results
| Parameters | Previous best result | Av (proposed) | Increased |
|---|---|---|---|
| Precision | 92.176 | 93.288 | 1.112 |
| Accuracy | 78.0 | 93.9 | 15.9 |
Fig. 14Average accuracy comparison of classification techniques