Literature DB >> 32645523

Comparative genomic signature representations of the emerging COVID-19 coronavirus and other coronaviruses: High identity and possible recombination between Bat and Pangolin coronaviruses.

Rabeb Touati1, Sondes Haddad-Boubaker2, Imen Ferchichi3, Imen Messaoudi4, Afef Elloumi Ouesleti5, Henda Triki2, Zied Lachiri6, Maher Kharrat3.   

Abstract

Coronaviruses are responsible on respiratory diseases in animal and human. The combination of numerical encoding techniques and digital signal processing methods are becoming increasingly important in handling large genomic data. In this paper, we propose to analyze the SARS-CoV-2 genomic signature using the combination of different nucleotide representations and signal processing tools in the aim to identify its genetic origin. The sequence of SARS-CoV-2 was compared with 21 relevant sequences including bat, yak and pangolin coronavirus sequences. In addition, we developed a new algorithm to locate the nucleotide modifications. The results show that the Bat and Pangolin coronaviruses were the most related to SARS-CoV-2 with 96% and 86% of identity all along the genome. Within the S gene sequence, the Pangolin sequence presents local highest nucleotide identity. Those findings suggest genesis of SARS-Cov-2 through evolution from bat and pangolin strains. This study offers new ways to automatically characterize viruses.
Copyright © 2019. Published by Elsevier Inc.

Entities:  

Keywords:  Bat; COVID19; Genome signature; Pangolin; SARS-CoV-2; Yak

Year:  2020        PMID: 32645523      PMCID: PMC7336935          DOI: 10.1016/j.ygeno.2020.07.003

Source DB:  PubMed          Journal:  Genomics        ISSN: 0888-7543            Impact factor:   5.736


Introduction

Recently, unidentified human pneumonia has started from a local fresh seafood market in Wuhan, in December 2019. This emerging virus was later identified as coronavirus called SARS-CoV-2 responsible on Coronavirus Disease 19 (COVID-19) [1,2]. It spreads to more than 216 countries all around the world causing a pandemic [3]. It is considered as the third major zoonotic human coronavirus outbreak of this century after the prominence of the two coronavirus pandemics; Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV) and the Middle East Respiratory Syndrome Coronavirus (MERS-CoV) [4,5]. Despite the fact that COVID-19 has a death rate of 2.8% as of April, the 8,844,171 confirmed cases with 465,460 confirmed deaths in a few months (December 8, 2019 to June 22, 2020) are terrifying. Indeed, this virus is highly contagious and the number of infected people can be doubled in less than seven days with a basic reproductive number (R0) of 2.2–2.7 [6]. The SARS-CoV-2 is an enveloped, positive-sense single-stranded RNA virus of almost 29,700 nucleotides. It belongs to the family of Coronaviridae and the subfamily of Orthocoronavirinae which contains 4 genera: Alphacoronavirus, Betacoronavirus, Gammacoronavirus and the recently defined Deltacoronavirus. SARS-CoV-2 is a member of the genera of Betacoronavirus and the SARSr-CoV specie responsible on severe lower respiratory tract infection in human as SARS and MERS infections [7]. Phylogenetic studies proved a complex coronaviruses evolutionary history originated from an ancient common ancestor (around 10,000 years ago) [8]. Indeed, these viruses present a high rate of mutation and recombination that enable great plasticity and high dynamic genome evolution. The mutation rate of coronaviruses varies from 1 /1000 to 1 /10,000 nucleotides during the replication of RNA-dependent RNA polymerases (RdRP) [9]. Also, coronaviruses are known to use template switching mechanism resulting in high rate of homologous RNA recombination among different viral strain genomes [4]. Furthermore, the existence of different virus hosts favors cross species infection resulting in adaptation of viruses and the emergence of new ones [8,9]. For instance, Bats can harbor different types of coronavirus creating a favorable environment for the emergence of new viruses [10]. Till now, scientists are trying to know how SARS-CoV-2 was emerged and infected Humans. Different hypotheses have been proposed. Recently, analysis of the whole genome of two viruses (HKU-SZ-002a and HKU-SZ-005b) confirmed that SARS-CoV-2 belongs to lineage B (Sarbecovirus) of Betacoronavirus and demonstrated the existence of a novel coronavirus genetically closer to the bat SARS-like coronavirus bat-SL-CoVZXC21 (MG772934) and bat-SL-CoVZC45 (MG772933) [10,11]. Another research in [12] proved that SARS-CoV-2 has the highest similarity (96.3%) with the Bat coronavirus RaTG13 all along the genome, using phylogenetic and similarity plot analysis. Other ones suggest that the human SARS-CoV-2, could also evaluate from Yak coronaviruses [13] and also through recombination with Pangolin coronaviruses [14,15]. The organism's genomic signature is a very important graphical representation that allows understanding of the intragenic variations [[16], [17], [18]], especially for handling large genomic data. This assay is based on the Chaos Game Representation (CGR) derived from the chaos theory of Jeffrey et al. (1990) and it was considered as a mapping method of large genome sequences [19]. In this study, we try to identify the SARS-CoV-2 genetic origin using a combination of different DNA representation and signal processing tools. We compared full genome sequence of SARS-CoV-2 to relevant viral genomes: sequences of the four Orthocoronavirinae genus, the 15th species included in the Betacoronavirus genus and also Yak and Pangolin coronaviruses.

Material and methods

First, we start by extracting the genomic sequences from the NCBI database (http://www.ncbi.nlm.nih.gov/Genbank/). Secondly, we applied to each sequence a CGR image to capture the information of the whole genome sequence. This considered step is followed by computing the centroid points (Chaos-Centroid) of M subsquare CGR images. The third step consists of applying the Smoothed Discrete Fourier Transform (SDFT) to the Frequency Chaos Game Signal (FCGS) order two, corresponding to genomic sequences with the goal of seeing the correlation between the genomes.

Genomic database

The complete genomes were downloaded from the NCBI database. Sequences investigated in this study are presented in Table 1 . Our genomic database contains in totally 22 species (Table 1).
Table 1

Genomics sequences database and their correspond length extracted from NCBI platform.

OrderSpeciesAccession numberLength (nt)
ASARS-CoV-2NC_04551229,903
BBetacoronavirus RaTG13MN99653229,855
CBetacoronavirus CoVZC45MG77293329,802
DBetacoronavirus CoVZXC21MG772934.129,732
EAlphacoronavirusDQ81178727,533
FGammacoronavirus A116E7FN43041527,593
GDeltacoronavirusKJ48193125,406
HBat Hp-betacoronavirus/NC_02521731,491
IBovine coronavirus MebusBCU0073531,032
JHuman coronavirus OC43KX34403130,713
KHuman coronavirus OC43 ATCC VR-759AY58522830,741
LPorcine hemagglutinating encephalomyelitis virus VW572DQ01185530,480
MMurine hepatitis virus JHMAC_00019231,526
NRat coronavirus ParkerFJ938068.131,250
OPipistrellus bat coronavirus HKU5/HK/03/2005NC_00902030,482
PRousettus bat coronavirus HKU9/GD/005/2005NC_00902129,114
QSARS-related Rhinolophus bat coronavirus Rf1/2004NC_004718.329,751
RSARS-related palm civet coronavirus SZ3/2003AY304486.129,741
SSARS-related chinese ferret badger coronavirus CFB/SZ/94/03AY54591929,739
TTylonycteris bat coronavirus HKU4/HK/04/2005NC_00901930,286
UYak coronavirus strain YAK/HY24/CH/2017MH81016331,032
VPangolin coronavirus isolate MP789MT12121629,521
Genomics sequences database and their correspond length extracted from NCBI platform.

CGR technique

CGR technique was proposed by Jeffrey as a unique and scale-independent representation for DNA sequences [19]. Mapping the genome sequence using the frequency chaos game representation (FCGR) produces fractal landscapes. This iterative mapping technique assigns, for each nucleotide in a DNA or amino acid in a protein, a unique coordinate in a 2-dimensional space (x, y). This 2-D image contains the distribution of the dots captured in a form of 0–1 square matrix, where 0 represents an empty coordinate and 1 represents a dot. Thus, an element occupying the nth position of the DNA sequence (Seq = S1, S2,.., SL) composed by L nucleotides (A, T, C, or G) is represented into a square by a point CGR . This point CGR is repeatedly placed halfway between the previous plotted point CGR and the segment joining the vertex corresponding to the read letter Sn [19]. This value is chosen according to previous research [16,17,19]. The proposed CGR steps are shown in the following Algorithm 1. Fig. 1.a shows the CGR representation plot of ‘AGCTGC’ sequence. The obtained point set shows a fractal pattern. The CGR algorithm applied to various species has produced specific and structured images that can differentiate each genome between the studied genomes. CGR-Centroid is the result of calculating the centroid point of each sub-image of CGR image after dividing the image into M sub-images. The number M is equal to 4m, and m presents the oligonucleotide length considered for the study that should be superior to zero. The idea is to divide the CGR image into sub-images to get the genomic signature for each region. Essentially, each pixel in a CGR image is associated to a specific position word in a given genomic sequence. Therefore, each visible pattern in the CGR corresponds to some specific pattern of the genomic sequence. For this, the CGR representation shows the global information of the nucleotide sequence. Each partition of this image contains a local information.
Fig. 1

CGR process and CGR-Centroid examples.

CGR process and CGR-Centroid examples. For example, if we divide the image to 4 images and calculate the center, we can find the CGR centroid corresponding to each nucleotide (A, T, C or G). When the two-point are within the same quadrant, they correspond to a succession of nucleotides in the sequences with the same last mononucleotide; and when they are within the same sub-quadrant, the sequences have the same last dinucleotides; and so on. Therefore, the coordinate of the centroid which corresponds to local information of the sub-region can differentiate the sequences and can be used to find the relationship between nucleotide sequences. To obtain the CGR Centroid, a mandatory step is to divide the CGR image into 4 squares. For CGR Centroid, the CGR image is partitioned into (n/M) × (n/M) equate sub-images, where n is the size of CGR image. Then, for each sub-region, all pairs of distances between the Covid-19 centroids and the others centroids sequences are extracted. After that, these distances can indicate the relation between the DNA sequences. The following algorithm 2 presents the steps GGR Centroid. Fig. 1.b shows an example of the CGR-centroid plot of each sub-region and the center of CGR (X) for COVID 19 after being partitioned into sub-regions of size 2 × 2. These distance values between each sub-region (A-centroid, T-centroid, C-centroid, and G-centroid) and the CGR-center (X) can indicate the existence of the similarities or not between the nucleotide sequences.

Time-frequency analysis technique

The numerical genomic representation using coding methods is an important step to visualize and characterize the hidden information that can be contain in it, especially in this case where the nucleotide sequences do not have any continuously or homologous between them. Different coding techniques exist: the binary [20], the structural bending trinucleotide (PNUC) [21], the electron-ion interaction pseudo-potential (EIIP) mapping [22], the FCGS [[23], [24], [25]], and so on. In addition, several signal processing techniques were applied with success to detect the relationship between sequences and detect some biological repetitive sequences, and so on. In this paper we use the Frequency Chaos Game Signal (FCGS) as a coding technique (first step) and the Smoothed Discrete Fourier Transform (SDFT) and the Wavelet Transform as an analysis techniques (second step).

Genomic signal representation

Nucleotide sequences are converted into a numerical sequence (1D signals) before processing from our database extracted from NCBI platform. Then, 1-D signals are generated by applying the FCGS order 2: FCGS2. This type of coding technique is based on the apparition's probability of two successive nucleotides in an entry sequence [[23], [24], [25]]. The probability (P2_nuc) of given L nucleotides in the sequence is as follows: N2_nuc represents the apparition number two successive nucleotides in the sequence and L represents the length in base pairs of the sequence. After that, in position (k), the oligomer (i), which consists of 2 nucleotides, is replaced by the corresponding occurrence probability:

Smoothed discrete Fourier transform: SDFT

We choose SDFT, a space-scale analysis, to be applied on nucleotide signals which is based on: where Δn represents the overlap value and l the window index. Here, W (the width of the window) must be chosen in such a way that the samples number of S[n] provides best frequency resolution. The sequence size and the type of window influence the frequency parameters values. N and R couple are taken as a power of two, this is recommended by the Fast Fourier Transform (FFT) algorithm [[20], [21], [22], [23]]. Dividing the helitron signal S[n] into R portions with an overlap ΔR. Dividing each portion into N frames by multiplication with a sliding window W[n]: Applying the Discrete Fourier Transform (DFT) on each weighted block S [n,  l]. In the spectral domain, S [ω] is given by: Here, ω represents the frequency index; ω ∈ [0 : N − 1] Calculating the mean value corresponding to each N frame within the R segments; then carrying out the DFT mean value for all R frames. The following equation (Eq. 5) gives the mean Smoothed Spectrum: Here, l represents the frame index of the N frames (l ∈ [0 : N − 1]) and i, the frame index of the R frames (i ∈ [0 : R − 1]). To ensure the best accurate smoothed spectrum, the Blackman window was chosen as window type. In addition we can follow the instantaneous frequencies evolution by considering the 2-D spectrogram representation resulting from the following mathematical equation: The final obtained matrix Mat contains the time-frequency information corresponding to the studied sequence. This is an efficient representation to visualize the evolution of periodicities along a nucleotide sequence.

Wavelet transform

The time-frequency nucleotide image with three color channels (Red, Green, Blue) is the best way to visualize the different patterns that can differentiate the sequences. If the pixels luminance in nucleotide image changes between two images, we can determine the modified patterns. To reach this goal, we choose the wavelet transform as an analyzing technique which present the nucleotide signal into a nucleotide image. Our choice is based on the performance of the wavelet transform which have been especially used in data related to the biological domain [22,24,26,27]. In this paper, the complex Morlet wavelet (CWT) has been used as a wavelet type which is the best one in terms of time-frequency domains localization. The method's principle consists of decomposing a nucleotide signal into a sum of basic functions called wavelets. These wavelets are issued from the mother wavelet by expansion and translation operations. These wavelets analysis technique applied to a given signal takes into account both time and frequency variations. Unlike the mother wavelet that only has a time variation parameter expressed by the function ψ(t), the daughter wavelet depends on time (a) and scale parameters (b) and it is generated by the expression given by the following equations, where * indicates the conjugate complex and ɷ0 is the oscillation's number that must be greater than 5 (admissibility condition) [[28], [29], [30]]: The CWT of a DNA signal x(t) is a matrix W (which contains the continuous wavelet coefficients. The DNA scalogram (2D) is a representation of the modulus |W (|. After obtaining the matrix W (of each sequence, for example SARS-COV-2 and RATG13, we develop a new algorithm which detects the nucleotide variation that exists between these sequences. The new algorithm we have developed is based on computing the correlation value between two matrices corresponding to two different genomes with a variable window. The following figure (Fig. 2 ) shows a flowchart methodology to extract similar nucleotide sequences in two genomes, example of SARS-COV-2 and betacoronavirus RATG13. The aim here is to find similar sequences and the modified nucleotides in two genomes by computing the correlation values and shifting one position to the next base pair if not equal to 1 until obtaining similar matrices.
Fig. 2

Flowchart diagram of the adopted localization methodology to extract similar nucleotide sequences between two genomes.

Flowchart diagram of the adopted localization methodology to extract similar nucleotide sequences between two genomes.

Recombination analysis

To detect recombination event, complete genomic sequences of Sars-Cov-2 with other coronavirus sequences were investigated. Sequences were first aligned using Clustal X program and then analyzed by Simplot program [31]. The default settings were used. These included window size = 200, a step size = 20, replicate used = 100, gap stripping = “on”, distance model=”Kimura”, tree model=”Neighbor Joining”.

Results

CGR analysis

The CGR image (2-D) graphical representations are the results of converting the nucleotides succession in a nucleotide sequence to a visual image. The CGR plots of all investigated sequences are presented in “Supplementary Fig. 1” file. Fig. 3 shows the correlation value between SARS-Cov-2 and others species. In general, our results shows that SARS-Cov-2 is close to betacoronavirus genomes (B, V, C, D, S, T, R, Q), more precisely, the higher correlation value (0.9) corresponds to SARS-CoV-2 Vs. Beta Cov-RaTG13 (B) and Pangolin coronavirus isolate MP789 (V) genomes, followed by the bat-SL-CoVZC45 (C) and bat-SL-CoVZXC21 (D). The Yak coronavirus strain YAK/HY24/CH/2017 (U) genome seems to be different (0.55 correlation value).
Fig. 3

CGR image correlation between SARS-CoV-2 and others genome; RaTG13 (B), Betacoronavirus CoVZC45, Betacoronavirus CoVZXC21(D), Alphacoronavirus DQ811787 (E), Gammacoronavirus A116E7 (F), Deltacoronavirus KJ481931 (G), Bat Hp-betacoronavirus (H), Bovine coronavirus Mebus(I), Human coronavirus OC43 (J), Human coronavirus OC43 ATCC (K), Porcine hemagglutinating encephalomyelitis virus VW572 (L), Murine hepatitis virus JHM (M), Rat coronavirus Parker (N), Pipistrellus bat coronavirus HKU5 (O), Rousettus bat coronavirus HK 9 (P), SARS-related Rhinolophus bat coronavirus (Q), SARS-related palm civet coronavirus (R), SARS-related chinese ferret badger coronavirus (S), Tylonycteris bat coronavirus HKU4 (T) Yak coronavirus strain YAK/HY24/CH/2017(U), and Pangolin coronavirus isolate MP789(V).

CGR image correlation between SARS-CoV-2 and others genome; RaTG13 (B), Betacoronavirus CoVZC45, Betacoronavirus CoVZXC21(D), Alphacoronavirus DQ811787 (E), Gammacoronavirus A116E7 (F), Deltacoronavirus KJ481931 (G), Bat Hp-betacoronavirus (H), Bovine coronavirus Mebus(I), Human coronavirus OC43 (J), Human coronavirus OC43 ATCC (K), Porcine hemagglutinating encephalomyelitis virus VW572 (L), Murine hepatitis virus JHM (M), Rat coronavirus Parker (N), Pipistrellus bat coronavirus HKU5 (O), Rousettus bat coronavirus HK 9 (P), SARS-related Rhinolophus bat coronavirus (Q), SARS-related palm civet coronavirus (R), SARS-related chinese ferret badger coronavirus (S), Tylonycteris bat coronavirus HKU4 (T) Yak coronavirus strain YAK/HY24/CH/2017(U), and Pangolin coronavirus isolate MP789(V). To confirm these results, we calculated the CGR Centroid between SARS-CoV-2 and other investigated sequences. Fig. 4 presents CGR Centroid points of each sequence. It shows the plot of CGR Centroid points of 4 regions (A, T, C, and G nucleotides) of investigated sequences (Fig. 4.a). Red point A presents the centroid of SARS-CoV-2 sequence. The Fig. 4.b shows the plot of CGR Centroid points of 16 regions (AA, AT, AC….GC, and GG dinucleotides) of each sequence.
Fig. 4

CGR centroids plots where the points; where the first subfigure (a) show the nucleotides centroid plot of 20 species; SARS-CoV-2 (A) Betacoronavirus RaTG13 (B), Betacoronavirus CoVZC45, Betacoronavirus CoVZXC21(D), Alphacoronavirus DQ811787 (E), Gammacoronavirus A116E7 (F), Deltacoronavirus KJ481931 (G), Bat Hp-betacoronavirus (H), Bovine coronavirus Mebus(I), Human coronavirus OC43(J), Human coronavirus OC43 ATCC (K), Porcine hemagglutinating encephalomyelitis virus VW572 (L), Murine hepatitis virus JHM (M), Rat coronavirus Parker(N), Pipistrellus bat coronavirus HKU5(O), Rousettus bat coronavirus HKU9(P), SARS-related Rhinolophus bat coronavirus (Q), SARS-related palm civet coronavirus (R), SARS-related chinese ferret badger coronavirus (S), Tylonycteris bat coronavirus HKU4(T), Yak coronavirus strain YAK/HY24/CH/2017(U), and Pangolin coronavirus isolate MP789(V), and the second figure (b) show the dinucleotides centroid plots of the more similar species.

CGR centroids plots where the points; where the first subfigure (a) show the nucleotides centroid plot of 20 species; SARS-CoV-2 (A) Betacoronavirus RaTG13 (B), Betacoronavirus CoVZC45, Betacoronavirus CoVZXC21(D), Alphacoronavirus DQ811787 (E), Gammacoronavirus A116E7 (F), Deltacoronavirus KJ481931 (G), Bat Hp-betacoronavirus (H), Bovine coronavirus Mebus(I), Human coronavirus OC43(J), Human coronavirus OC43 ATCC (K), Porcine hemagglutinating encephalomyelitis virus VW572 (L), Murine hepatitis virus JHM (M), Rat coronavirus Parker(N), Pipistrellus bat coronavirus HKU5(O), Rousettus bat coronavirus HKU9(P), SARS-related Rhinolophus bat coronavirus (Q), SARS-related palm civet coronavirus (R), SARS-related chinese ferret badger coronavirus (S), Tylonycteris bat coronavirus HKU4(T), Yak coronavirus strain YAK/HY24/CH/2017(U), and Pangolin coronavirus isolate MP789(V), and the second figure (b) show the dinucleotides centroid plots of the more similar species. The degree of correlation is displayed by the distance variation between the points. Fig. 5 shows that points B (Betacoronavirus RaTG13), V (Pangolin coronavirus isolate MP789), C (Bat Betacoronavirus CoVZC45) and D (Bat coronavirus CoVZXC21) are more strongly correlated with point A (SARS-CoV-2) and we can see that distances between points A, B and V are the shortest one. This obtained result confirms the similarities obtained by CGR analysis.
Fig. 5

Graphical representation of distance values between each nucleotide region (A, T, C and G) CGR centroids of SARS-Cov-2 (A) and other investigated viruses.

Graphical representation of distance values between each nucleotide region (A, T, C and G) CGR centroids of SARS-Cov-2 (A) and other investigated viruses. The distance values of CGR centroid between SARS-CoV-2 (A) and other investigated virus sequences are presented in Fig. 5. The nearest to zero the value is, the greater is the similarity between SARS-CoV-2 genome and other genome which is compared to. It appears that the most similar genome corresponds to genome B (Betacoronavirus RaTG13) and V (Pangolin coronavirus isolate MP789).

SDFT time frequency analysis

The application of the SDFT spectral analysis to the genomic signal gives us the opportunity to detect any latent or hidden periodic signal in the original sequences. Here, the idea is to characterize each genome independently of their length with a specific specter (1D signature) and spectrum (2D signature) which indicate the variation region if it exist between more of genomes. Exploring the latent periodicities of global genomes using SDFT method can play a key role in the homology detection between these viruses' classes. For more investigation, we visualize the sequences in 1D spectrum and 2D spectrogram representations by applying the SDFT to the FCGS2 signals. These representations reflect the time-frequency signatures of each sequence which may differentiate each genome or indicate the similarities between them by highlighting its periodicities. In this work, the “Blackman” window was chosen as window type and we consider 1024 as the R frames length, with a shift index Δr = 512, for the sub-frames, we take N = 256, with Δn = 64. The major effect of windowing is guarantee the converting the frequency response discontinuities into transition bands between values on either side of the discontinuity. The spectral representation (1D) of different genomes and the 2-D spectrum representation are presented in Fig. 6 and Supplementary Fig. 2.
Fig. 6

Spectra superposition of Sequences (20 viruses) investigated in this study (a) and of the more correlated genome Cov-RATG13 to SARS-Cov2 (A) genome (b), where these viruses are Betacoronavirus RaTG13 (B), Betacoronavirus CoVZC45, Betacoronavirus CoVZXC21(D), Alphacoronavirus DQ811787 (E), Gammacoronavirus A116E7 (F), Deltacoronavirus KJ481931 (G), Bat Hp-betacoronavirus (H), Bovine coronavirus Mebus(I), Human coronavirus OC43(J), Human coronavirus OC43 ATCC (K), Porcine hemagglutinating encephalomyelitis virus VW572 (L), Murine hepatitis virus JHM (M), Rat coronavirus Parker(N), Pipistrellus bat coronavirus HKU5(O), Rousettus bat coronavirus HKU9(P), SARS-related Rhinolophus bat coronavirus (Q), SARS-related palm civet coronavirus (R), SARS-related chinese ferret badger coronavirus (S), Tylonycteris bat coronavirus HKU4(T), Yak coronavirus strain YAK/HY24/CH/2017(U), and Pangolin coronavirus isolate MP789 (V).

Spectra superposition of Sequences (20 viruses) investigated in this study (a) and of the more correlated genome Cov-RATG13 to SARS-Cov2 (A) genome (b), where these viruses are Betacoronavirus RaTG13 (B), Betacoronavirus CoVZC45, Betacoronavirus CoVZXC21(D), Alphacoronavirus DQ811787 (E), Gammacoronavirus A116E7 (F), Deltacoronavirus KJ481931 (G), Bat Hp-betacoronavirus (H), Bovine coronavirus Mebus(I), Human coronavirus OC43(J), Human coronavirus OC43 ATCC (K), Porcine hemagglutinating encephalomyelitis virus VW572 (L), Murine hepatitis virus JHM (M), Rat coronavirus Parker(N), Pipistrellus bat coronavirus HKU5(O), Rousettus bat coronavirus HKU9(P), SARS-related Rhinolophus bat coronavirus (Q), SARS-related palm civet coronavirus (R), SARS-related chinese ferret badger coronavirus (S), Tylonycteris bat coronavirus HKU4(T), Yak coronavirus strain YAK/HY24/CH/2017(U), and Pangolin coronavirus isolate MP789 (V). The superposed spectra presented in Fig. 6 reflect the existence of similarities between all investigated sequences and highlight the high correlation between SARS-Cov-2 genome and Betacoronavirus RaTG13 Bat (A) and Pangolin coronavirus isolate MP789 (V) genomes followed by two Betacoronavirus CoVZC45 (B) and Betacoronavirus CoVZXC21 (C) genomes. The spectrum correlation values between SARS-CoV-2 and Betacoronavirus RaTG13 and Pangolin coronavirus isolate MP789 genomes are about 0.995 and 0.9889 respectively, the highest value among investigated sequences. In the other hand it's about 0.5518 for Yak coronavirus strain YAK/HY24/CH/2017 genome (Supplementary Fig. 2). For more interpretation, we can use the Neighbor joining (NJ) clustering which is an alternative method for hierarchical cluster analysis [32]. Here, we can draw phylogenetic trees of the SARS-Cov-2 and other investigated viruses using the calculated spectra correlation values between SRS-Cov2 and other viruses (B to T) mentioned in “Supplementary Fig. 3”. The dendrogram in Fig. 7 was developed using the Past PAleontological Statistics program version 3.23 [33]. It shows the homology viruses degree to SARA-Cov2 generated from phylogenetic relationships using the spectra from SDFT method SARS-Cov-2 (A) is close to betacoronavirus genomes type: B, V, C, D, S, T, U, R, and Q successively. This phylogenetic tree confirms our previous results that indicate a high correlation between the SARS-Cov-2 genome and Betacoronavirus RaTG13 (B) Bat and Pangolin coronavirus isolate MP789 (V) genomes followed by two Betacoronavirus CoVZC45 (C) and Betacoronavirus CoVZXC21 (D) genomes. (See Fig. 7.)
Fig. 7

Phylogenetic tree of SARS-COV-2 and the investigated viruses (21 viruses) using the spectra correlation vector and Neighbor-Joining method. The most related sequences to SARS-Cov-2 are highlighted with a red circle. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Phylogenetic tree of SARS-COV-2 and the investigated viruses (21 viruses) using the spectra correlation vector and Neighbor-Joining method. The most related sequences to SARS-Cov-2 are highlighted with a red circle. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

CWT time frequency analysis

In this work, we developed a new algorithm to extract the zone of similarities between two sequences using signal processing tools. The scalograms signatures (2-D images) of all investigated sequences are presented in “Supplementary Fig. 4” file. As an example, the Fig. 8 shows a scalogram representation of a limited nucleotide sequences of the two coronavirus genomes; SARS-Cov-2 (26,004 to 26,503 nucleotide position) and Betacoronavirus RATG13 (25,973 to 26,473 nucleotide position). These two scalograms covering 500 nucleotides, present the time-frequency representation of the sequences. After applying our method we find 4 nucleotides modifications between two sequences; one C➔T in 26,023 bp and T➔C in [26,053 bp; 26,169 bp; 26,332 bp] positions (Fig. 9 ).
Fig. 9

Dispatching of mutation between SARS-COV-2 and betacoronavirus RATG13 genomes ratios found using our methods.

2-D representations (scalograms) of two nucleotide sequences with size equal to 500 bp show the high homology between these sequences. Dispatching of mutation between SARS-COV-2 and betacoronavirus RATG13 genomes ratios found using our methods. The combination of bioinformatics and signal processing tools applied to genetic sequences show that SARS-Cov-2 is highly related to RATG13 Betacoronavirus and Pangolin coronavirus isolate MP789 with 96% and 86% identity along the whole genome. This global result is similar to result obtained by BLAST platform. To highlight these results, we analyzed nucleotide modification positions of both viruses in comparison with SARS-Cov-2 genome. We developed a new algorithm based on the scalogram images resulting from wavelet transform applied to the genomic signal. Table 2 summarizes nucleotide modification positions according to SARS-Cov-2 genome and identity percent for each genomic region. The total modification number is about 1173 and 2780 nucleotides for RATG13 Betacoronavirus and Pangolin coronavirus isolate MP789 respectively. For all the genes RATG13 sequences are most related to SARS-Cov-2 than Pangolin coronavirus sequences.
Table 2

Nucleotide modifications distribution along the SARS-COV 2 genome comparing to the Cov-RATG13 genome.

RegionBeginEndLengthProteinNucleotide modification RATG13
Nucleotide modification Pangolin
Number (%)Number (%)
15′UTR162652506 (2.4%)2 (0.8%)
2Gene = “orf1ab”26621,55521,290GU280_gp01744 (3.494%)1924 (9%)
3Gene = “S”21,56325,3843822GU280_gp02271 (7.09%)323 (8.45%)
4Gene = “ORF3a”25,39326,220828GU280_gp0331 (3.743%)34 (4.106%)
5Gene = “E”26,24526,472228GU280_gp041 (0.438%)2 (0.877%)
6Gene = “M”26,53227,191660GU280_gp0530 (4.545%)38 (5.75%)
7Gene = “ORF6”27,20227,387186GU280_gp063 (1.612%)3 (1.61%)
8Gene = “ORF7a”27,39427,759366GU280_gp0716 (4.371%)24 (6.557%)
9Gene = “ORF7b”27,75627,887132GU280_gp081 (0.757%)132 (100%)
10Gene = “ORF8”27,89428,259366GU280_gp0911 (3.005%)229 (62.5%)
11Gene = “N”28,27429,5331260GU280_gp1039 (3095%)48 (3.809%)
12Gene = “ORF10”29,55829,674117GU280_gp111 (0,854%)1 (0.854%)
133′UTR29,67529,8892293 (1.39%)0 (0%)
Nucleotide modifications distribution along the SARS-COV 2 genome comparing to the Cov-RATG13 genome. We can clearly see the great modification number in S gene which with 271 (7.09%) for RATG13 genome and 323 (8.45%) for Pangolin coronavirus genome. Comparing the scalogram images of S gene of SARS-Cov2 to RATG13 and Pangolin coronavirus genomes, we find that the homology changes depending on selected regions in this gene. For this reason, we adopted our analysis algorithm according to the region where the similarities in the alignment are strong (Table 3 ). Between the nucleotide positions [1251-1600] pangolin sequence showed the lower nucleotide modification than RATG13. These result suggest a possible recombination event in the S gene.
Table 3

Nucleotide modifications distribution of S gene genome comparing to the S genes of Cov-RATG13 and Pangolin genomes.

Position of selected region in S geneLengthNucleotide modification Pangolin
Nucleotide modification
Number (%)RATG13
Number (%)
1–769769All (100%)38 (4.094%)
770–125048181 (16.83%)16 (3.326%)
1251–160035050 (14.28%)125 (35.7%)
1601–250090036 (4%)30 (3.33%)
2501–2550504 (8%)9 (18%)
2551–38221272132 (10.3%)52 (4.088)
Nucleotide modifications distribution of S gene genome comparing to the S genes of Cov-RATG13 and Pangolin genomes. Our results of the comparison between SARS-CoV-2 and Cov-RATG13 along the genome, showed 1173 nucleotide modifications dispatched as shown in the following Fig. 9. The minimum mutation ratio is for the nucleotide G that becomes C with 1.22% ratio, corresponding to numbers of 6 mutations. The minimum mutation ratio is for the nucleotide T that becomes C with 1.22% ratio. The global results are presented in the “Supplementary results” file.

Recombination analysis result

To confirm occurrence of recombination between RATG 13 and pangolin MP789 sequences, we assessed Simplot analysis. Fig. 10 shows evidence of possible recombination event between the nucleotide positions 1250 and 1575 of the S gene.
Fig. 10

Simplot analysis of SARS-cov-2 genome in comparison with RATG13 and Pangolin coronavirus genomes: (a) Simplot analysis of complete genome and (b) S gene Simplot analysis of S gene.

Simplot analysis of SARS-cov-2 genome in comparison with RATG13 and Pangolin coronavirus genomes: (a) Simplot analysis of complete genome and (b) S gene Simplot analysis of S gene.

Discussion

The world health organization informed in December 31, 2019 that the unexplained respiratory disease called COVID-19 is caused by a new coronavirus called SARS-CoV-2. This virus caught the attention of scientists, who set out to analyze the virus, the disease it causes, and how it spreads. Finding the evolutionary relationship between these coronaviruses, and their genomic characterization is a crucial task. It can improve our understanding about the evolution and generation of new viruses and help in the future the prevention against new emergence. In this paper, we used combination of different DNA representation and signal processing tools to compare full genome sequence of SARS-CoV-2 to relevant viral genomes including the most related ones: Bat, Yak and Pangolin coronaviruses. The image genomic signature provided a rapid identification of relationship similarity between SARS-CoV-2 and other coronaviruses genomes. We used three different representations and we developed a new algorithm allowing global and local comparisons of genomic sequences and thus informing on possible recombination event occurrence. The obtained results using different representations were confirmed by each other. Furthermore, when compared to results based on alignment methods (Blast and Simplot) same findings were obtained. The used methods are an alignment free method, requesting only a sequence database and a bioinformatics and signal processing knowledges [[16], [17], [18]]. Those methods transformed every nucleotide sequence in numerical form. CGR technique has been used because it has a remarkable capacity to differentiate between genetic sequences belonging to different species [17]. Comparing to alignment-based methods, it uses an image-based approach generated by different techniques to elaborate genomic signatures. These signatures can be processed by mathematical computations and signal processing tools to extract remarkable features as CGR centroid points and time-frequency representation (specter, spectrogram, and scalogram). The CGR centroid plots is a very efficient method to know the relationship between DNA sequences in each square of image. In addition, the spectral representations reflect the frequency repartition of the nucleotides in the genomic sequences which can indicate the position or zone of the great change between the genome signals. It also allows construction of the phylogenetic tree of SARS-Cov-2 in comparison to other coronaviruses. Furthermore, we developed a new algorithm that allows identification of nucleotide modification between SARS-Cov-2 and their closer genomes without any need to previous alignment. The main advantage of our algorithm consists in the generation of a table output with local identity percent and position allowing suspicion of possible recombination events as demonstrated in our study. This study shows that based on genomic signatures representations, we can easily assess automatic homology identification between genomes and thus they allow rapid genomic characterization when we are compressed by time, such as this critical period during the outbreak of this novel viral. It permits to rapidly mobilize appropriate methods for diagnostic and medical care specific to the identified viruses. Furthermore, it's also crucial to the choice of appropriate preventive methods and contingency plans. In the future, alignment free methods can also be advanced and adapted to specific biological tasks and biologist needs in different fields. Over all obtained results, signal processing tools highlight the high homology of SARS-CoV-2 and other betacoronavirus than other families and more precisely with Cov-RATG13 and Pangolin coronavirus isolate MP789 genomes. However, the Yak coronavirus strain YAK/HY24/CH/2017 seems to be different [13]. These results show evolutionary evidence of the SARS-CoV-2- from Bat and Pangolin coronaviruses as a consequence of accumulation of mutations and also acquisition of new genomic region by recombination events. Prior scientists searched the evolutionary history of the Wuhan COVID-19 virus [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15],34,35]. They discovered that three bat SARS-like coronaviruses, Betacoronavirus RaTG13, bat-SL-CoVZC45 and bat-SL-CoVZXC21, were closely related to SARS-CoV-2 [35]. Nevertheless, it was not clear, till now, if the COVID-19 arose from a recombination event or no between those viruses [12]. Others scientists suggested possible genesis of COVID-19 virus through recombination between Bat [10,11] and Pangolin coronaviruses [14,15]. Our results support this possibility of COVID-19 virus genesis though evolution of Bat and Pangolin coronaviruses by accumulation of point mutations and recombination. Close contact between humans and wild animals and consumption of bat and Pangolin meat can creat favorable conditions for evolution of viruses of both hosts.

Conclusion

This study offers new and rapid ways to automatically identify the homology between known viruses and emergent ones, given the opportunity for rapid classification and identification of virus origin. Using bioinformatics combined with signal processing tools, we confirmed the high homology of SARS-CoV-2 with bat BetaCov-RaTG13 and Pangolin coronavirus isolate MP789 genomes. Thus, our technique can be used to extract numerical features to classify the viruses and to perform evolutionary study of viruses.

Declaration of competing interest

None.
  20 in total

1.  The Helitron family classification using SVM based on Fourier transform features applied on an unbalanced dataset.

Authors:  Rabeb Touati; Afef Elloumi Oueslati; Imen Messaoudi; Zied Lachiri
Journal:  Med Biol Eng Comput       Date:  2019-08-17       Impact factor: 2.602

2.  Chaos game representation of gene structure.

Authors:  H J Jeffrey
Journal:  Nucleic Acids Res       Date:  1990-04-25       Impact factor: 16.971

3.  Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison.

Authors:  Tung Hoang; Changchuan Yin; Stephen S-T Yau
Journal:  Genomics       Date:  2016-08-15       Impact factor: 5.736

4.  The neighbor-joining method: a new method for reconstructing phylogenetic trees.

Authors:  N Saitou; M Nei
Journal:  Mol Biol Evol       Date:  1987-07       Impact factor: 16.240

5.  Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia.

Authors:  Ali M Zaki; Sander van Boheemen; Theo M Bestebroer; Albert D M E Osterhaus; Ron A M Fouchier
Journal:  N Engl J Med       Date:  2012-10-17       Impact factor: 91.245

6.  High Contagiousness and Rapid Spread of Severe Acute Respiratory Syndrome Coronavirus 2.

Authors:  Steven Sanche; Yen Ting Lin; Chonggang Xu; Ethan Romero-Severson; Nick Hengartner; Ruian Ke
Journal:  Emerg Infect Dis       Date:  2020-06-21       Impact factor: 6.883

Review 7.  Molecular epidemiology, evolution and phylogeny of SARS coronavirus.

Authors:  Hayes K H Luk; Xin Li; Joshua Fung; Susanna K P Lau; Patrick C Y Woo
Journal:  Infect Genet Evol       Date:  2019-03-04       Impact factor: 3.342

8.  Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus.

Authors:  Ben Hu; Lei-Ping Zeng; Xing-Lou Yang; Xing-Yi Ge; Wei Zhang; Bei Li; Jia-Zheng Xie; Xu-Rui Shen; Yun-Zhi Zhang; Ning Wang; Dong-Sheng Luo; Xiao-Shuang Zheng; Mei-Niang Wang; Peter Daszak; Lin-Fa Wang; Jie Cui; Zheng-Li Shi
Journal:  PLoS Pathog       Date:  2017-11-30       Impact factor: 6.823

9.  Molecular conservation and differential mutation on ORF3a gene in Indian SARS-CoV2 genomes.

Authors:  Sk Sarif Hassan; Pabitra Pal Choudhury; Pallab Basu; Siddhartha Sankar Jana
Journal:  Genomics       Date:  2020-06-12       Impact factor: 5.736

10.  Emergence of SARS-CoV-2 through recombination and strong purifying selection.

Authors:  Xiaojun Li; Elena E Giorgi; Manukumar Honnayakanahalli Marichannegowda; Brian Foley; Chuan Xiao; Xiang-Peng Kong; Yue Chen; S Gnanakaran; Bette Korber; Feng Gao
Journal:  Sci Adv       Date:  2020-07-01       Impact factor: 14.957

View more
  10 in total

1.  Sequence Analysis and Structure Prediction of SARS-CoV-2 Accessory Proteins 9b and ORF14: Evolutionary Analysis Indicates Close Relatedness to Bat Coronavirus.

Authors:  Chittaranjan Baruah; Papari Devi; Dhirendra K Sharma
Journal:  Biomed Res Int       Date:  2020-10-20       Impact factor: 3.411

2.  Population Genomics Insights into the First Wave of COVID-19.

Authors:  Maria Vasilarou; Nikolaos Alachiotis; Joanna Garefalaki; Apostolos Beloukas; Pavlos Pavlidis
Journal:  Life (Basel)       Date:  2021-02-07

Review 3.  Host Diversity and Potential Transmission Pathways of SARS-CoV-2 at the Human-Animal Interface.

Authors:  Hayden D Hedman; Eric Krawczyk; Yosra A Helmy; Lixin Zhang; Csaba Varga
Journal:  Pathogens       Date:  2021-02-08

4.  Intelligent system based comparative analysis study of SARS-CoV-2 spike protein and antigenic proteins in different types of vaccines.

Authors:  Rabeb Touati; Ahmed A Elngar
Journal:  Beni Suef Univ J Basic Appl Sci       Date:  2022-03-07

5.  Spatial epidemiology and genetic diversity of SARS-CoV-2 and related coronaviruses in domestic and wild animals.

Authors:  Ariful Islam; Jinnat Ferdous; Md Abu Sayeed; Shariful Islam; Md Kaisar Rahman; Josefina Abedin; Otun Saha; Mohammad Mahmudul Hassan; Tahmina Shirin
Journal:  PLoS One       Date:  2021-12-15       Impact factor: 3.240

Review 6.  Chaos game representation and its applications in bioinformatics.

Authors:  Hannah Franziska Löchel; Dominik Heider
Journal:  Comput Struct Biotechnol J       Date:  2021-11-10       Impact factor: 7.271

Review 7.  Comprehensive role of SARS-CoV-2 spike glycoprotein in regulating host signaling pathway.

Authors:  Shuvomoy Banerjee; Xinyu Wang; Shujuan Du; Caixia Zhu; Yuping Jia; Yuyan Wang; Qiliang Cai
Journal:  J Med Virol       Date:  2022-05-09       Impact factor: 20.693

8.  New methodology for repetitive sequences identification in human X and Y chromosomes.

Authors:  Rabeb Touati; Asma Tajouri; Imen Mesaoudi; Afef Elloumi Oueslati; Zied Lachiri; Maher Kharrat
Journal:  Biomed Signal Process Control       Date:  2020-10-19       Impact factor: 3.880

Review 9.  Transmission dynamics and susceptibility patterns of SARS-CoV-2 in domestic, farmed and wild animals: Sustainable One Health surveillance for conservation and public health to prevent future epidemics and pandemics.

Authors:  Ariful Islam; Jinnat Ferdous; Shariful Islam; Md Abu Sayeed; Md Kaisar Rahman; Otun Saha; Mohammad Mahmudul Hassan; Tahmina Shirin
Journal:  Transbound Emerg Dis       Date:  2021-11-09       Impact factor: 4.521

10.  Uncovering Signals from the Coronavirus Genome.

Authors:  Enrique Canessa
Journal:  Genes (Basel)       Date:  2021-06-25       Impact factor: 4.096

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.