Literature DB >> 33421654

Topological analysis for sequence variability: Case study on more than 2K SARS-CoV-2 sequences of COVID-19 infected 54 countriesvin comparison with SARS-CoV-1 and MERS-CoV.

Jnanendra Prasad Sarkar1, Indrajit Saha2, Arijit Seal3, Debasree Maity4, Ujjwal Maulik5.   

Abstract

The pandemic due to novel coronavirus, SARS-CoV-2 is a serious global concern now. More than thousand new COVID-19 infections are getting reported daily for this virus across the globe. Thus, the medical research communities are trying to find the remedy to restrict the spreading of this virus, while the vaccine development work is still under research in parallel. In such critical situation, not only the medical research community, but also the scientists in different fields like microbiology, pharmacy, bioinformatics and data science are also sharing effort to accelerate the process of vaccine development, virus prediction, forecasting the transmissible probability and reproduction cases of virus for social awareness. With the similar context, in this article, we have studied sequence variability of the virus primarily focusing on three aspects: (a) sequence variability among SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host, which are in the same coronavirus family, (b) sequence variability of SARS-CoV-2 in human host for 54 different countries and (c) sequence variability between coronavirus family and country specific SARS-CoV-2 sequences in human host. For this purpose, as a case study, we have performed topological analysis of 2391 global genomic sequences of SARS-CoV-2 in association with SARS-CoV-1 and MERS-CoV using an integrated semi-alignment based computational technique. The results of the semi-alignment based technique are experimentally and statistically found similar to alignment based technique and computationally faster. Moreover, the outcome of this analysis can help to identify the nations with homogeneous SARS-CoV-2 sequences, so that same vaccine can be applied to their heterogeneous human population.
Copyright © 2021. Published by Elsevier B.V.

Entities:  

Keywords:  COVID-19; Coronavirsus; Genomic sequences; Global health; MERS-CoV; SARS-CoV-2; Sequence alignment free technique

Year:  2021        PMID: 33421654      PMCID: PMC7787073          DOI: 10.1016/j.meegid.2021.104708

Source DB:  PubMed          Journal:  Infect Genet Evol        ISSN: 1567-1348            Impact factor:   3.342


Introduction

A worldwide epidemic due to outbreak of a virus disease, COVID-19 caused by a novel virus called Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) is first found in Wuhan City, Hubei Province of China on 31st December 2019 as informed by the China Country Office of World Health Organization (WHO) (WHO, 2020; Zhu et al., 2020). The infection and death rate have grown exponentially in subsequent days where more than 43 million infection cases with more than 1 million death cases for coronavirus across globe are reported in “worldometers.info”2 as on 25th October 2020. According to the medical research community, SARS-CoV-2 transmits human-to-human more rapidly than SARS-CoV-1, but the origin of SARS-CoV-2 is suspected as bat or pangolin (Andersen et al., 2020; Zhang and Holmes, 2020; Zhou et al., 2020). In human-to-human transmission medium, the coronavirus is found transmitting via droplets from one infected person to another individual while the infected person coughs or sneezes over a short distance. The COVID-19 disease seems to be a threat to the human extinct and to control the spreading of the virus, several measures like complete lockdown are taken globally by almost every infected nation. However, as the long term lockdown is not a permanent solution and has deep impact in global economy (Haleem et al., 2020a), medical research communities of entire world are working to find appropriate vaccine and drug. In this regard, a recent study (Zhu et al., 2020) has found genetic features like potential etiological agents of the SARS-CoV-2 after metagenomic analysis using next-generation sequencing (NGS). A separate research (Wan et al., 2020) has discovered how the spike protein receptor-binding domain (RBD) of SARS-CoV-2 binds with host receptor angiotensin-converting enzyme 2 (ACE2), which is generally considered as primary regulatory agent for transmission of COVID-19 disease in both cross-species and human. Moreover, angiotensin and PPAR family proteins have been found playing vital role in COVID-19 infection (Dey et al., 2020). In this regard, the goal of our earlier research (Saha et al., 2020a; Saha et al., 2020b) was to elucidate the genetic variability of India specific SARS-CoV-2. However, it is important to study the virus genome sequences of different countries in order to understand the sequence variability using topological analysis in faster way, so that the nations with homogeneous sequences can be identified to apply the same vaccine. This fact motivated us to study and experiment the sequence variability in three aspects: (a) sequence similarity measure among three different coronaviruses such as SARS-CoV-1, MERS-CoV and SARS-CoV-2 within same family in human host, (b) SARS-CoV-2 sequence similarity measure in 54 different countries in human host to cover various geographical origins and (c) sequence similarity between coronavirus family and country specific SARS-CoV-2 sequences in human host. For this purpose, as a case study we have considered 340, 291 and 2391 global genomic sequences of SARS-CoV-1, MERS-CoV and SARS-CoV-2 respectively. The biological experiments to characterize the genetical insights by analyzing genome sequences are typically expensive and the cost grows almost exponentially with the increasing number of sequences. In this regard, computational techniques play an important role. The most widely used sequence comparison method proposed in (Smith and Waterman, 1981) uses dynamic programming to compute optimal local alignment. However, this fails to compare distant sequences effectively. Sequence comparing technique by iteratively pairwise sequence alignment or using multi-sequence alignment techniques are computationally expensive and eventually can be considered as NP-Complete problem. Therefore, heuristic methods like ClustalW (Thompson et al., 1994), Clustal Omega (Sievers et al., 2011) and MUSCLE (Edgar, 2004) gained popularity for multi-sequence alignment to extract valuable information from genome sequences. Subsequently, several other alignment based methods like ARCS (Song et al., 2006), profile Hidden Markov Model (pHMM) (Eddy, 1998), PFam (Punta et al., 2012) are also proposed in this respect. However, major challenge in almost all the alignment based methods is the requirement of correct alignment of multiple sequences. Moreover, computing optimal multi-sequence is NP-Hard problem and it is difficult to compute score correctly for more than two nucleotides. Therefore, less computationally expensive alignment free techniques like k-mer (Manekar and Sathe, 2018; Solis-Reyes et al., 2018) gained popularity. Being an essential part of many bioinformatics processes like genome and transcriptome assembly, metagenomic sequencing, error correction of sequence reads etc. (Melsted and Pritchard, 2011), the importance and superiority of k-mer technique over other techniques like REGA (Pineda-Pena et al., 2013), SCUEAL (Pond et al., 2009), COMET (Struck et al., 2014) etc. have been explained in (Solis-Reyes et al., 2018). Hence, in this article we have proposed semi-alignment based technique using k-mer for sequence analysis. Additionally, t-distributed Stochastic Neighbor Embedding (tSNE) (Hinton and Roweis, 2002) is used on count vector generated through k-mer and n-gram techniques for visualization purpose. After topological study on global sequence variability, we have reported different analytical findings such as (a) genome sequences variability among different SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host, (b) SARS-CoV-2 sequence variability in human host of different global geographical locations to understand the probable relation and (c) sequence variability between coronavirus family and country specific SARS-CoV-2 sequences in human host. These information can be considered to apply same vaccine to the countries with similar genome sequences.

Research objectives

While research communities across world are working on biological and technological advancements (Bahl et al., 2020; Haleem et al., 2020b; Javaid et al., 2020; Singh et al., 2020a; Singh et al., 2020b; Singh et al., 2020c; Vaishya et al., 2020a; Vaishya et al., 2020b) to fight against the challenging pandemic situation of COVID-19, our prime objective of this research is to study the topological genome sequence variability among three intra-family coronaviruses viz. SARS-CoV-1, MERS-CoV and SARS-CoV-2, sequence variability of SARS-CoV-2 in human host for countries in diverse geographical locations as well as sequence variability between coronavirus family and country specific SARS-CoV-2 sequences in human host using semi-alignment based technique. With the analytical findings, the study aims to help medical communities in finding nations with homogeneous sequences for effective application of vaccine and drug. As an additional important objective, the experimental study of this article also shows the similarities between the results produced by both computationally expensive sequence alignment based and computationally less expensive semi-alignment based technique in order to speed up the research in designing prophylactic vaccine and therapeutic drug for SARS-CoV-2.

Materials and method

In order to perform experiment and analytical study for research objectives mentioned earlier, first we have collected relevant genome sequences and performed certain important data pre-processing, applied semi-alignment based technique on pre-processed data. Fig. 1 depicts the flow of entire experiment and the following subsections describe the various processes in detail.
Fig. 1

Steps of the workflow.

Steps of the workflow.

Data preparation

For the experiment, we have downloaded genome sequences of SARS-CoV-1 and MERS-CoV from The National Center for Biotechnology Information (NCBI),3 whereas genome sequences of SARS-CoV-2 are collected from Global Initiative on Sharing All Influenza Data (GISAID)4 in fasta format. In order to perform the experiment, it is important to consider the complete or near complete genome sequences of the virus. Therefore, basic data pre-processing is applied to filter the sequences of SARS-CoV-1, MERS-CoV and SARS-CoV-2 having average length more than 29.5 kbp in human host to avoid any incomplete sequence. The Statistics of the refined datasets of coronaviruses are reported in Tables 1 for human host, while the country wise statistics and geoplot of same SARS-CoV-2 sequences are reported in Table 2 and Fig. 2 .
Table 1

Statistics of the refined genome sequences of coronaviruses in human host.

Virus nameSource of sequenceNo. of sequenceMax length of sequenceAvg length of sequence
SARS-CoV-1NCBI34030,31129,514
MERS-CoVNCBI29130,15029,983
SARS-CoV-2GISAID239129,98629,512
Table 2

Statistics of the country wise refined genome sequences of SARS-CoV-2 in human host.

CountryNo. of sequencesCountryNo. of sequencesCountryNo. of sequences
USA588Canada17Hungary2
Iceland343Italy17Thailand2
China321Singapore14Cambodia1
Netherlands190Finland13Colombia1
England160South Korea13Ecuador1
Wales107Georgia10Lithuania1
Japan83Luxembourg10Mexico1
France75Denmark9Nepal1
Australia64Malaysia8Nigeria1
Belgium45New Zealand8Northern Ireland1
Portugal44Norway8Pakistan1
India35Chile7Panama1
Brazil34Ireland6Peru1
Switzerland31Vietnam6Poland1
Germany27Kuwait4Russia1
Spain27Slovakia4South Africa1
Congo19Czech Republic3Sweden1
Scotland18Saudi Arabia3Turkey1
Fig. 2

Geoplot of SARS-CoV-2 sequences in human host for 54 countries.

Statistics of the refined genome sequences of coronaviruses in human host. Statistics of the country wise refined genome sequences of SARS-CoV-2 in human host. Geoplot of SARS-CoV-2 sequences in human host for 54 countries.

Semi-alignment based technique

In order to perform the analysis, we have integrated sequence alignment free technique, k-mer (Manekar and Sathe, 2018; Solis-Reyes et al., 2018) together with count vectorization using n-gram and tSNE. k-mer is used to create set of descriptors from complete genome sequence of virus. Subsequently, n-gram is used to build feature by selecting n number of descriptors. Each such n-gram feature set is called Bag-of-Descriptors (BoD). Out of these BoDs, count vectorization technique is used to create numeric feature vector by counting the frequencies of each of n-gram features. Such k-mer generated descriptors and top 10 n-gram BoDs are shown in Fig. 3 . Primarily the numeric feature vector is used to perform tSNE with prior information of the cluster of the virus sequences in order to reduce dimension into two. Thereafter, the reduced numerical dataset of virus sequences and the prior cluster information are used to find the medoid sequences of the clusters in euclidean space. The clusters are broadly in two different types where (a) E  = {E |  i = 1, …, 3} is set of medoid reference sequences of three virus wise clusters of SARS-CoV-1, MERS-CoV and SARS-CoV-2 for human host and (b) E  = {E |  i = 1, …, 54} is set of medoid reference sequence of country wise 54 clusters for human host respectively. In this regard, class type (a) is for all three virus sequences while class type (b) is for SARS-CoV-2 in human host only. Here, the euclidean distance based medoid of each cluster represents the reference genome sequence for its population. During the analysis, the similarities among reference sequences are computed based on pairwise alignment, because the size of the set of reference sequences is significantly less as compared to the total number of sequences. Thus, our technique is semi-alignment based while the whole experiment is called as a topological analysis. This technique covers all sequences in less computational time.
Fig. 3

Word cloud of k-mer (k = 3) generated descriptors for (a) SARS-CoV-1 (c) MERS-CoV and (e) SARS-CoV-2 sequences and top 10 n-grams (n = 4) of k-mer generated descriptors generated for (b) SARS-CoV-1 (d) MERS-CoV and (f) SARS-CoV-2 sequences in human host.

Word cloud of k-mer (k = 3) generated descriptors for (a) SARS-CoV-1 (c) MERS-CoV and (e) SARS-CoV-2 sequences and top 10 n-grams (n = 4) of k-mer generated descriptors generated for (b) SARS-CoV-1 (d) MERS-CoV and (f) SARS-CoV-2 sequences in human host.

Results and discussion

Each cluster is an embedded representation of virus sequences as 2D data points generated by tSNE using k-mer generated descriptors and n-gram techniques, where k = 3 and n = 4 are fixed experimentally. Such embedded representation of clusters of SARS-CoV-1, MERS-CoV and SARS-CoV-2 sequences and SARS-CoV-2 sequences in human host of top 20 countries are shown visually in Figs. 4 (a) and (b). Each cluster is homogeneous in nature, which means the set of sequences represented by 2D data points within same cluster has maximum similarity. On the other hand, two different clusters consisting of sequences represented by 2D data points have less similarities. With these cluster data, E  = {E |  i = 1, …, 3} and E  = {E |  i = 1, …, 54} reference sequences are computed as mentioned in method section. These reference sequences are used for detail experimental analysis, which are described in following subsections. This is to be noted that as a comparative study, the result of alignment based technique are also compared with the result of our semi-alignment based technique.
Fig. 4

Embedded representation of (a) SARS-CoV-1, MERS-CoV and SARS-CoV-2 sequences and (b) SARS-CoV-2 sequences in human host for top 20 countries.

Embedded representation of (a) SARS-CoV-1, MERS-CoV and SARS-CoV-2 sequences and (b) SARS-CoV-2 sequences in human host for top 20 countries.

Coronavirus family specific analysis for human host

E is set of three reference sequences of virus clusters such as SARS-CoV-1, MERS-CoV and SARS-CoV-2, where medoid of each cluster is considered as reference sequence. Table 3 and Fig. 5 (a) show that medoid sequence similarities among SARS-CoV-1, MERS-CoV and SARS-CoV-2 computed using semi-alignment based technique, where it is observed that sequence of SARS-CoV-2 is approximately 76.59% similar to SARS-CoV-1 at the nucleotide level, and it is almost dissimilar to the sequence of MERS-CoV with only ≈ 36.09% similarity. The research study in (Lan et al., 2020; Wrapp et al., 2020) claim the suitability of SARS-CoV-1 and SARS-CoV-2 for binding with human ACE2 receptor. Hence approximately 76.59% similarity of SARS-CoV-2 with SARS-CoV-1 in our experiment also suggests the correctness of our approach in terms of medoid sequence similarity measures. In this regard, ≈ 23.41% dissimilarity between SARS-CoV-2 and SARS-CoV-1 may be the cause of wide spreading of SARS-CoV-2. Additionally, Table 4 and Fig. 5 (b) show the comparative observation of the heatmap results generated by alignment based technique respectively. The comparative observation also suggests the equivalence in results of both the techniques.
Table 3

Similarity measure among virus reference sequences as medoid (E) of SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host using semi-alignment based technique.

Virus in human hostSARS-CoV-1MERS-CoVSARS-CoV-2
SARS-CoV-1100.0035.8376.59
MERS-CoV35.83100.0036.09
SARS-CoV-276.5936.09100.00
Fig. 5

Heatmap of results produced by (a) semi-alignment based technique and (b) alignment based technique, for measuring similarity among reference sequences as medoid (E) of SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host.

Table 4

Similarity measure among virus reference sequences as medoid (E) of SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host using alignment based technique.

Virus in human hostSARS-CoV-1MERS-CoVSARS-CoV-2
SARS-CoV-1100.0035.7576.77
MERS-CoV35.75100.0036.09
SARS-CoV-276.7736.09100.00
Similarity measure among virus reference sequences as medoid (E) of SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host using semi-alignment based technique. Heatmap of results produced by (a) semi-alignment based technique and (b) alignment based technique, for measuring similarity among reference sequences as medoid (E) of SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host. Similarity measure among virus reference sequences as medoid (E) of SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host using alignment based technique.

Country specific SARS-CoV-2 analysis for human host

In this analysis E , which is set of reference sequences for 54 countries of SARS-CoV-2 in human host is analyzed to understand the topological sequence variability of all against all nations. In this respect, an embedded representation of country wise SARS-CoV-2 sequence distribution in human host considering medoid data point as reference sequence of each country sequence population is shown in Figs. 6 . Subsequently, a heatmap is generated using semi-alignment based technique and shown in Fig. 7(a), where it is observed that Brazil has different variant of sequence as compared to other countries. For further analysis, a bar plot for measuring similarity among country wise SARS-CoV-2 reference sequences as medoid (E ) is also presented in Fig. 7(b). Here, each country bar represent an aggregated (scaled in range [0,1]) similarity values of all other countries with respect to that particular country. For example, bar value of Australia represents that the Australia sequence population is ≈ 90% similar to other countries. As observed from Fig. 7(b), it is evident that almost all nations have maximum inter-country similarities in SARS-CoV-2 reference sequences, except Brazil, Colombia, Ecuador, Ireland and Wales which have comparatively less inter-country similarities in SARS-CoV-2 reference sequences. Moreover, Figs. 8(a) and (b) are generated using results produced by alignment based technique as to compare the similar results of Figs. 7(a) and (b) produced by semi-alignment based technique. In this case, both of the figures suggest that results produced by semi-alignment based technique is almost equivalent to the results produced by alignment based technique.
Fig. 6

Embedded representation of country wise reference sequences as medoid (E) of SARS-CoV-2 in human host.

Fig. 7

(a) Heatmap (b) Bar plot representation of aggregated results produced by semi-alignment based technique for measuring similarity among country wise SARS-CoV-2 reference sequences as medoid (E) in human host.

Fig. 8

(a) Heatmap (b) Bar plot representation of aggregated results produced by alignment based technique for measuring similarity among country wise SARS-CoV-2 reference sequences as medoid (E) in human host.

Embedded representation of country wise reference sequences as medoid (E) of SARS-CoV-2 in human host. (a) Heatmap (b) Bar plot representation of aggregated results produced by semi-alignment based technique for measuring similarity among country wise SARS-CoV-2 reference sequences as medoid (E) in human host. (a) Heatmap (b) Bar plot representation of aggregated results produced by alignment based technique for measuring similarity among country wise SARS-CoV-2 reference sequences as medoid (E) in human host.

Coronavirus family and country specific SARS-CoV-2 analysis for human host

Additionally, set of reference sequences (E ) for 54 countries of SARS-CoV-2 in human host is also analyzed with E , which is set of reference sequences as medoid of SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host. For this purpose, similarity measures as computed between E  = {E |  i = 1, …, 3} and E  = {E |  i = 1, …, 54} using semi-alignment based and alignment based techniques are reported in Table 5, Table 6 respectively. Subsequently, a visual representation of Table 5 is also shown as circos plot in Fig. 9 . Interestingly, from both the Table 5, Table 6, the SARS-CoV-2 reference sequence of Brazil is found little different than reference sequence of SARS-CoV-2 family. Rather, the reference sequence of Brazil has comparatively high similarity with reference sequence of SARS-CoV-1 family. For rest of the nations, it is observed that sequences of most of the nations are very much close to sequence of SARS-CoV-2 family. For example, Australia, Belgium, Congo, India, Italy, USA etc. have ≈ 99.94%, ≈ 99.85%, ≈ 98.98%, ≈ 99.75%, ≈ 99.83%, ≈ 98.97% similarity respectively.
Table 5

Similarity measure between virus reference sequences as medoid (E) of SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host with country wise SARS-CoV-2 reference sequences as medoid (E) in human host using semi-alignment based technique.

CountrySARS-CoV-1MERS-CoVSARS-CoV-2CountrySARS-CoV-1MERS-CoVSARS-CoV-2
Australia76.5536.1199.94Mexico76.6235.9399.57
Belgium76.6936.2199.85Nepal76.6636.1199.80
Brazil65.4032.5388.46Netherlands76.5736.1699.88
Cambodia76.7136.1199.69New Zealand76.1736.0599.56
Canada75.7035.2698.92Nigeria75.5336.1798.67
Chile76.3235.9999.76Northern Ireland74.5736.2698.21
China76.6536.0899.88Norway76.7536.1299.74
Colombia73.1335.9897.01Pakistan76.7136.1699.80
Congo75.6835.9998.98Panama76.2935.1199.21
Czech Republic76.7136.1399.69Peru76.7536.1499.72
Denmark76.7536.2399.81Poland76.6435.8499.58
Ecuador72.6631.3396.04Portugal76.4436.2599.78
England75.8735.5798.82Russia76.1836.1099.40
Finland76.5336.2597.58Saudi Arabia76.6235.9099.58
France76.5336.1399.63Scotland74.4936.3698.07
Georgia76.6536.1099.80Singapore76.6836.0299.79
Germany76.5736.0899.96Slovakia76.6335.9399.58
Hungary76.5636.1399.94South Africa76.6335.9399.57
Iceland76.6836.1199.67South Korea76.4436.0899.68
India76.7236.1799.75Spain75.9036.1899.00
Ireland73.5035.2197.05Sweden76.6735.9299.63
Italy76.4536.2799.83Switzerland76.6736.0199.61
Japan76.7736.1899.83Thailand76.7836.1799.78
Kuwait76.6736.0099.60Turkey76.4536.0399.35
Lithuania76.7536.1599.72USA75.5935.8698.97
Luxembourg76.7636.0999.77Vietnam76.6835.9799.62
Malaysia76.6435.9899.62Wales74.0135.9497.33
Table 6

Similarity measure between virus reference sequences as medoid (E) of SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host with country wise SARS-CoV-2 reference sequences as medoid (E) in human host using alignment based technique.

CountrySARS-CoV-1MERS-CoVSARS-CoV-2CountrySARS-CoV-1MERS-CoVSARS-CoV-2
Australia76.3835.8799.67Mexico76.7635.8299.74
Belgium76.6836.1199.91Nepal76.6336.0199.81
Brazil66.0434.5988.47Netherlands76.5436.1399.80
Cambodia76.8736.0199.86New Zealand76.5336.0399.80
Canada76.0935.8999.06Nigeria75.5136.1298.56
Chile76.5636.1499.82Northern Ireland74.5536.3498.07
China76.7836.0799.96Norway76.8736.1199.88
Colombia73.1135.9196.86Pakistan76.7736.0799.93
Congo76.5336.1099.71Panama76.4134.9999.37
Czech Republic76.8136.0399.90Peru76.8136.0499.89
Denmark76.7736.0899.95Poland76.7735.7399.75
Ecuador72.6631.2995.91Portugal76.4636.1599.74
England76.6036.1799.85Russia76.2636.0099.41
Finland76.5336.1799.77Saudi Arabia76.7735.8099.75
France76.7836.0599.91Scotland74.4836.2297.92
Georgia76.7236.0536.40Singapore76.7736.0199.83
Germany76.5336.1099.80Slovakia76.7635.8999.75
Hungary76.5236.0599.78South Africa76.7735.8299.74
Iceland76.7636.0999.97South Korea76.4835.9399.69
India76.8236.0299.89Spain76.5336.0399.79
Ireland73.4835.0996.90Sweden76.8135.8199.79
Italy76.5636.1599.79Switzerland76.7535.9099.75
Japan76.7636.0899.99Thailand76.8036.0699.91
Kuwait76.7535.8999.77Turkey76.5835.9099.52
Lithuania76.8436.0499.89USA76.8535.8199.83
Luxembourg76.8635.8299.83Vietnam76.7935.8699.78
Malaysia76.8536.0699.89Wales76.5535.9599.80
Fig. 9

Circos plot of similarity measure between country wise SARS-CoV-2 reference sequences as medoid (E) in human host with virus reference sequences as medoid (E) of SARS-CoV-1, MERS and SARS-CoV-2 in human host using semi-alignment based technique.

Similarity measure between virus reference sequences as medoid (E) of SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host with country wise SARS-CoV-2 reference sequences as medoid (E) in human host using semi-alignment based technique. Similarity measure between virus reference sequences as medoid (E) of SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host with country wise SARS-CoV-2 reference sequences as medoid (E) in human host using alignment based technique. Circos plot of similarity measure between country wise SARS-CoV-2 reference sequences as medoid (E) in human host with virus reference sequences as medoid (E) of SARS-CoV-1, MERS and SARS-CoV-2 in human host using semi-alignment based technique.

Comparative analysis between semi-alignment based technique and alignment based technique

While entire analysis is reported with the result generated using semi-alignment based technique, we have also performed the experiment using alignment based technique to understand the equivalence of the outcome between these two techniques. Therefore, it can also establish the fact that the reference sequences that are identified using semi-alignment based and alignment based techniques are equivalent. For this purpose, we have performed two-sample Kolmogorov-Smirnov (KS) (Massey, 1951) test. In this regard, the KS test is performed on the null hypothesis that the sequence similarity results produced by both semi-alignment based and alignment based techniques are same with 5% significant level. This suggests that the null hypothesis is accepted when p-value is greater than 0.05. The mean p-values of test result are reported in Fig. 10 , while the detail results produced by both the techniques are given in supplementary material. From Fig. 10 bar (A) it is evident that p-value of sequence similarity produced by semi-alignment based technique as reported in Table 3 and alignment based technique as reported in Table 4 for intra-virus classes among SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host is 0.97 which is much greater than 0.05 and signifies that results produced by both the techniques are equivalent. Similarly, p-values of KS test between Fig. 7(a) and Fig. 8(a) and Table 5, Table 6 generated by semi-alignment based and alignment based techniques are 0.40 and 0.97 respectively, which prove that results generated by both the techniques are similar.
Fig. 10

Mean p-value of two-sample Kolmogorov-Smirnov test between genome sequence similarity results produced by semi-alignment based technique and alignment based technique for (A) virus reference sequences as medoid (E) in human host (B) country wise SARS-CoV-2 reference sequences as medoid (E) in human host (C) virus reference sequences as medoid (E) of SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host with country wise SARS-CoV-2 reference sequences as medoid (E) in human host.

Mean p-value of two-sample Kolmogorov-Smirnov test between genome sequence similarity results produced by semi-alignment based technique and alignment based technique for (A) virus reference sequences as medoid (E) in human host (B) country wise SARS-CoV-2 reference sequences as medoid (E) in human host (C) virus reference sequences as medoid (E) of SARS-CoV-1, MERS-CoV and SARS-CoV-2 in human host with country wise SARS-CoV-2 reference sequences as medoid (E) in human host. Moreover, to compute sequence similarity after performing sequence alignment using Clustal Omega, it takes around 2 days in Intel Core i5-2410M CPU at 2.30 GHz machine with 8GB RAM, whereas with the same configuration of machine it takes less than an hour for complete analysis using semi-alignment based technique. Thus, semi-alignment based technique is found computationally economical and effective for fast outcome.

Summary of outcome

After performing topological experiments for (a) coronavirus family specific analysis for human host, (b) country specific SARS-CoV-2 analysis for human host and (c) coronavirus family and country specific SARS-CoV-2 analysis for human host, we can summarize the findings from our case study and possible suggestions as follows. SARS-CoV-2 sequence in human host is found approximately 76.59% similar to SARS-CoV-1 sequence and almost dissimilar to MERS-CoV in human host. Moreover, approximately 23.41% dissimilarity between SARS-CoV-2 and SARS-CoV-1 sequences might be the reason of wide spreading capability of SARS-CoV-2. It is found that certain nations are having very high sequence similarities, while analyzing inter-country sequence of SARS-CoV-2. Therefore, nations with similar sequences might think of using similar vaccine or drug. There are certain nations like Brazil, Ecuador, Iceland etc. should be paid more attention for detail genetic analysis as these nations have little different sequence than other nations. SARS-CoV-2 Sequences of all nations are approximately 99% similar to sequence of SARS-CoV-2 family, whereas approximately 76% and 36% similar to SARS-CoV-1 and MERS-CoV respectively in coronavirus family. However, Brazil is found as little exceptional, where SARS-CoV-2 sequence of Brazil is less similar to sequences of SARS-CoV-2 family. Instead, SARS-CoV-2 sequence of Brazil is approximately 65.40% similar to SARS-CoV-1 family. We have used both computationally expensive sequence alignment based technique as well as less computationally expensive semi-alignment based technique for sequence variability analysis. It is experimentally found that outcomes of both the techniques are close to each other. Therefore, semi-alignment based technique can be used to speed up the research on SARS-CoV-2 in critical situation for designing prophylactic vaccine and therapeutic drug (Dey et al., 2017; Nandy and Basak, 2016) of SARS-CoV-2.

Conclusions

In world health context, COVID-19 disease caused by SARS-CoV-2 has created a pandemic in human population with large number of infection. It is found that person-to-person transmission plays a crucial role in spreading of the COVID-19 disease. Various regulating measures, such as lockdown, are taken worldwide to restrict the transmission. Various researches are being conducted across globe to find appropriate vaccine and drug. Therefore, in this article, we have performed a case study on the genome sequence variability for global health using semi-alignment based technique. We also have conducted experiments using sequence alignment based technique, where it has been found that less computationally expensive semi-alignment based technique has produced results as equivalent as expensive alignment based technique. We have studied primarily focusing on three aspects such as (a) coronavirus family specific sequence variability for human host, (b) country specific sequence variability of SARS-CoV-2 of 54 nations for human host and (c) coronavirus family and country specific SARS-CoV-2 sequence variability for human host. In order to perform the analysis, we have computed medoid reference sequence for each of the categories. It has been observed that the reference sequences across different countries are quite similar with certain exceptional cases like Brazil, Iceland etc. For these exceptional cases, our future research is focused on detail genomic study of SARS-CoV-2 to understand the genome-wide variability for global health. Moreover, our future research will also focus to explore the application of pure alignment free based technique effectively in order to overcome the limitation of current article where we have applied semi-alignment based technique.

Ethics approval and consent to participate

The ethical approval or individual consent was not applicable.

Availability of data and materials

The datasets of SARS-CoV-1, MERS-CoV and SARS-CoV-2 sequences used in our case study, reference sequences identified as mediod in case of virus wise and country wise analysis and software are available at “http://www.nitttrkol.ac.in/indrajit/projects/COVID-TopologicalAnalysis-SequenceVariability/”. Moreover, all the virus sequences used in this work are publicly available at NCBI and GISAID databases.

Consent for publication

Not applicable.

Funding

This work has been partially supported by CRG short term research grant on COVID-19 (CVD/2020/000991) from (SERB), Department of Science and Technology, Govt. of India.

CRediT authorship contribution statement

Jnanendra Prasad Sarkar: Conceptualization, Formal analysis, Validation, Visualization, Writing - original draft. Indrajit Saha: Conceptualization, Data curation, Supervision, Funding acquisition, Software, Formal analysis, Investigation, Methodology, Web development, Project administration, Resources, Validation, Visualization, Writing - review & editing. Arijit Seal: Conceptualization, Formal analysis, Writing - review & editing. Debasree Maity: Conceptualization, Data curation, Methodology, Writing - review & editing. Ujjwal Maulik: Conceptualization, Methodology, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no conflict of interest.
  33 in total

Review 1.  Profile hidden Markov models.

Authors:  S R Eddy
Journal:  Bioinformatics       Date:  1998       Impact factor: 6.937

2.  Unveiling COVID-19-associated organ-specific cell types and cell-specific pathway cascade.

Authors:  Ashmita Dey; Sagnik Sen; Ujjwal Maulik
Journal:  Brief Bioinform       Date:  2021-03-22       Impact factor: 11.622

3.  Significant applications of virtual reality for COVID-19 pandemic.

Authors:  Ravi Pratap Singh; Mohd Javaid; Ravinder Kataria; Mohit Tyagi; Abid Haleem; Rajiv Suman
Journal:  Diabetes Metab Syndr       Date:  2020-05-12

4.  An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes.

Authors:  Stephen Solis-Reyes; Mariano Avino; Art Poon; Lila Kari
Journal:  PLoS One       Date:  2018-11-14       Impact factor: 3.240

Review 5.  Artificial Intelligence (AI) applications for COVID-19 pandemic.

Authors:  Raju Vaishya; Mohd Javaid; Ibrahim Haleem Khan; Abid Haleem
Journal:  Diabetes Metab Syndr       Date:  2020-04-14

Review 6.  Internet of things (IoT) applications to fight against COVID-19 pandemic.

Authors:  Ravi Pratap Singh; Mohd Javaid; Abid Haleem; Rajiv Suman
Journal:  Diabetes Metab Syndr       Date:  2020-05-05

7.  Inferring the genetic variability in Indian SARS-CoV-2 genomes using consensus of multiple sequence alignment techniques.

Authors:  Indrajit Saha; Nimisha Ghosh; Debasree Maity; Nikhil Sharma; Kaushik Mitra
Journal:  Infect Genet Evol       Date:  2020-09-01       Impact factor: 3.342

8.  Effects of COVID-19 pandemic in daily life.

Authors:  Abid Haleem; Mohd Javaid; Raju Vaishya
Journal:  Curr Med Res Pract       Date:  2020-04-03

9.  Emerging Technologies to Combat the COVID-19 Pandemic.

Authors:  Raju Vaishya; Abid Haleem; Abhishek Vaish; Mohd Javaid
Journal:  J Clin Exp Hepatol       Date:  2020-05-05

10.  The proximal origin of SARS-CoV-2.

Authors:  Kristian G Andersen; Andrew Rambaut; W Ian Lipkin; Edward C Holmes; Robert F Garry
Journal:  Nat Med       Date:  2020-04       Impact factor: 87.241

View more
  1 in total

1.  Online Predictor Using Machine Learning to Predict Novel Coronavirus and Other Pathogenic Viruses.

Authors:  Jnanendra Prasad Sarkar; Indrajit Saha; Nimisha Ghosh; Debasree Maity; Dariusz Plewczynski
Journal:  ACS Omega       Date:  2022-06-28
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.