| Literature DB >> 35095217 |
Tirthankar Paul1, Seppo Vainio2, Juha Roning1.
Abstract
In this study, chaos game representation (CGR) is introduced for investigating the pattern of genome sequences. It is an image representation of the genome for the overall visualization of the sequence. The CGR representation is a mapping technique that assigns each sequence base into the respective position in the two-dimension plane to portray the DNA sequence. Importantly, CGR provides one to one mapping to nucleotides as well as sequence. A coordinate of the CGR plane can tell the corresponding base and its location in the original genome. Therefore, the whole nucleotide sequence (until the current nucleotide) can be restored from the one point of the CGR. In this study, CGR coupled with artificial neural network (ANN) is introduced as a new way to represent the genome and to classify intra-coronavirus sequences. A hierarchy clustering study is done to validate the approach and found to be more than 90% accurate while comparing the result with the phylogenetic tree of the corresponding genomes. Interestingly, the method makes the genome sequence significantly shorter (more than 99% compressed) saving the data space while preserving the genome features.Entities:
Keywords: Artificial neural network; Chaos game representation; Coronavirus
Year: 2022 PMID: 35095217 PMCID: PMC8779865 DOI: 10.1016/j.eswa.2022.116559
Source DB: PubMed Journal: Expert Syst Appl ISSN: 0957-4174 Impact factor: 6.954
Fig. 1CGR for the short sequence ‘ACTTGAATG’.
Fig. 2Frequency Chaos Game Representation.
Fig. 3The 2-D representation of Coronavirus.
Fig. 4Neural Network model for the virus classification.
Fig 5aConfusion Matrix for the Coronavirus classification of data.
Fig. 5bConfusion Matrix for the Coronavirus classification of data.
Fig. 6aDendrogram tree from the data.
Fig. 6bDendrogram tree from the data.
Fig. 7Phylogenetic tree for randomly selected coronavirus genomes.