Literature DB >> 34655896

Understanding mutation hotspots for the SARS-CoV-2 spike protein using Shannon Entropy and K-means clustering.

Baishali Mullick1, Rishikesh Magar2, Aastha Jhunjhunwala3, Amir Barati Farimani4.   

Abstract

The SARS-CoV-2 virus like many other viruses has transformed in a continual manner to give rise to new variants by means of mutations commonly through substitutions and indels. These mutations in some cases can give the virus a survival advantage making the mutants dangerous. In general, laboratory investigation must be carried to determine whether the new variants have any characteristics that can make them more lethal and contagious. Therefore, complex and time-consuming analyses are required in order to delve deeper into the exact impact of a particular mutation. The time required for these analyses makes it difficult to understand the variants of concern and thereby limiting the preventive action that can be taken against them spreading rapidly. In this analysis, we have deployed a statistical technique Shannon Entropy, to identify positions in the spike protein of SARS Cov-2 viral sequence which are most susceptible to mutations. Subsequently, we also use machine learning based clustering techniques to cluster known dangerous mutations based on similarities in properties. This work utilizes embeddings generated using language modeling, the ProtBERT model, to identify mutations of a similar nature and to pick out regions of interest based on proneness to change. Our entropy-based analysis successfully predicted the fifteen hotspot regions, among which we were able to validate ten known variants of interest, in six hotspot regions. As the situation of SARS-COV-2 virus rapidly evolves we believe that the remaining nine mutational hotspots may contain variants that can emerge in the future. We believe that this may be promising in helping the research community to devise therapeutics based on probable new mutation zones in the viral sequence and resemblance in properties of various mutations.
Copyright © 2021 The Authors. Published by Elsevier Ltd.. All rights reserved.

Entities:  

Keywords:  Clustering; Mutations; SARS-CoV-2; Shannon entropy

Mesh:

Substances:

Year:  2021        PMID: 34655896      PMCID: PMC8492016          DOI: 10.1016/j.compbiomed.2021.104915

Source DB:  PubMed          Journal:  Comput Biol Med        ISSN: 0010-4825            Impact factor:   4.589


Contributions of the work: The paper proposes a computational methodology to identify potential mutational hotspots in spike protein of SARS-CoV-2. The high throughput methodology can also identify some of the dangerous mutations emerging in the distant future Understand and identify the similarities and patterns among the different type of mutations using clustering analysis. Such an analysis may possibly help biologists to better understand the relationships between SARS-CoV-2 mutations.

Introduction

The SARS-CoV-2 virus has rapidly evolved by continually mutating, affecting more than 180 million people across the globe. Ever since the genome sequence of SARS-CoV-2 became available, mutations at several sites in the genome have been identified raising concerns regarding enhanced transmissibility of the virus [1]. The mutating nature of the virus has inspired global efforts from research community to actively track and understand the emergence of variants of concern[[2], [3], [4]]. One of the first mutation that rapidly spread throughout the world, mutation D614G, was first reported in April 2020 [5]. This mutation has now been classified under several lineages and is found to be a factor in increased transmission of the virus [[6], [7], [8], [9]]. The discovery of this mutation was followed by identification of a series of mutations in the virus belonging to the B. 1.1.7 lineage, which was first found in the Southeast of England [10]. The mutations, namely A222V, S477 N, N501, H69, N439K, Y453F,11S98F, D80Y, A626S, V1122L, have been noted as variants of interest in many studies [[10], [11], [12], [13]] and are the focus of this work as well. These variants were selected because they were marked as the Variant Under Investigation SARS-CoV-2 VUI 202012/01 (Variant Under Investigation, the year 2020, month 12, variant 01) by different studies done in the United Kingdom [14]. The mutation, A222V belongs to the B.1.177 lineage and has been noted to have a dominating presence in European countries[15,16]. N439K and Y453F have been found to have a higher binding affinity to the hACE2 receptor and are noted to reduce the neutralizing potential of antibodies specific to SARS-CoV-2 [[17], [18], [19]]. N439K often co-occurs with 69–70 deletion in the spike protein, the effect of this combined double mutation is being investigated by researchers (COVID-19 Genomics UK consortium, 2021; [20]. N501Y is the causative factor in the increased infectiousness of the disease [21]. The numerous effects of such mutations on the increased transmissibility and lethality of SARS-CoV-2, make it imperative to study these mutations and understand their effects[22]. To tackle the COVID -19 pandemic, efforts from the researchers have involved exploring traditional paradigm of in-vitro experimentation and data analysis-based methodologies like machine learning. Data driven modelling techniques, with their ability to analyze large amounts of data, build a functional mapping between the input parameters and output. This paper explores the use of data-driven methodologies to understand the mutations in the SARS-CoV-2 spike proteins. To understand and identify the mutation hotspots we have examined the sequence entropy and its correlation with experimentally identified variants of concern. Tomaszewski et al., defined mutational entropy as a measure of molecular heterogeneity of the SARS-CoV-2 proteome which is estimated from the positional variance in these sequences [7]. In our work, we measure the positional variance in the sequence of the SARS-CoV-2 spike proteins by calculating Shannon Entropy. In case of proteins, Shannon entropy is shown to have a strong correlation with protein structural entropy [23], and can provide insights into the compositional stability of the proteins. The Shannon entropy is also directly proportional to the inverse packing density of proteins [24], and the packing density is further related to increased mutagenesis. Moreover, higher local flexibility regions have an increased value of entropy and are prone to mutations [21]. Our study explores these relationships of Shannon entropy to estimate the mutational hotspots in the SARS-CoV-2 spike protein. Higher value of entropy at a position in the sequence is indicative of increased randomness at that site whereas low value of entropy at a certain site is indicative of an increased stability and decreased randomness at the said location. Apart from identifying the hotspots of interest, we also analyze the similarity of these mutations by employing a k-means clustering algorithm. To generate the embedding for the clustering algorithm we leverage the protein sequence data by using language modeling approaches. Through transfer learning, some of the highly successful models in the Natural Language Processing (NLP) domain have been applied to protein sequence to generate meaningful representations that can be used in tasks like structure prediction [25]. We used the Prot-BERT language modeling to represent these spike protein sequences in the form of semantic rich embeddings [26]. The Prot-BERT model has been trained on 80 billion amino acids, representing wide variety of protein sequences. The embeddings generated via the Prot-BERT model can be used for different downstream tasks. In our work, we use embeddings to determine the similarities between mutations using unsupervised machine learning techniques. This analysis will help in understanding the relationships between the mutations and assist the research community to tackle the virus.

Related work

Machine learning models have been used in many ways to study and understand the different aspects of COVID-19 pandemic. These models have been previously used for forecasting the COVID-19 cases [[27], [28], [29]], propose the potential antibodies [30], understand the possible evolutions of the virus [31], understand the economic and social effects of social distancing [32,33], understand the efficiency of lockdowns [34], study the transmission and spread of the virus [35,36]. Data driven models have also been used to analyze the SARS-CoV-2 mutations. In their paper [37], use techniques topological like persistent homology to understand the SARS-CoV-2 mutations and uncover some underlying patterns. In another study [38], develop the Informative Subtype Markers (ISM) to visualize and analyze the spread of different mutated SARS-CoV-2 sequences.

Methods

Data

To understand the effect of the mutations we focus only on the spike protein of the virus sequence. We select the spike protein region because it is the major component of the SARS-CoV-2 virus that is responsible for eliciting host immune responses of neutralizing antibodies. It is the presence of this spike protein on the antigen that allows it to interact and penetrate the host cells. Therefore, more attention to spike protein has been given in the analysis of the mutations of the SARS-CoV-2 virus. To this end, we collect the spike protein data from the GISAID server to analyze the effect of the mutations on the spike protein on its transmissibility. We downloaded three hundred eleven thousand two hundred and fifty-six spike protein sequences from the GISAID server (https://www.gisaid.org/) on January 3, 2020 [11,39]. The comprehensive dataset had sequences related to the SARS-CoV-1 virus too, therefore the first stage of preprocessing involved the elimination of sequences that were not from 2020. This resulted in a dataset comprising three hundred ten thousand five hundred and ten sequences. Most of these sequences are comprised of 1273 amino acids, with maximum length being 1278 amino acids. To ensure uniformity in our calculation of the positional entropy, the ones with length less than 1278 were made up to length 1278 by appending the relevant number of ‘X's to the end of the gene sequence for the entropy analysis. The original spike protein sequence found in Wuhan is referenced from Zhao et al. [1] and the mutations in all the collected sequences in the data are analyzed with respect to this sequence. There was a large presence of repeated spike protein sequences found in different countries, so we decided to curate the data further and create data with only the unique sequences as featurizing the same sequence twice using Prot-BERT would have been redundant. We found fifty-three thousand eight hundred and ninety-eight belonging to prime variants of interest that are unique sequences of the spike protein. Subsequently, this dataset was used to generate embedding via the ProtBERT Model. These embeddings were further used to carry out unsupervised machine learning analysis. To understand the spread of the data and visualize it, we generated the plot using t-SNE [40] shown in Fig. 1 .
Fig. 1

t-SNE plot capturing the distribution of the data collected from the GISAID server. Some of the variants of concern like N439K, N501Y are clustered near each other. From the t-SNE, we can easily infer that the SARS-CoV-2 mutations have unique characteristics.

t-SNE plot capturing the distribution of the data collected from the GISAID server. Some of the variants of concern like N439K, N501Y are clustered near each other. From the t-SNE, we can easily infer that the SARS-CoV-2 mutations have unique characteristics. Further, we also analyze the geographical locations and the general distribution of the countries that were a part of the dataset we found that United Kingdom and Denmark contributed to over 50% of the mutation sequences in the dataset with 140458 mutation sequences from United Kingdom and 20346 from Denmark. These two countries have proactively studied the different mutations and made the data available for public use via the GISAID server. To analyze the mutation sequence data from other countries, a distribution of the dataset comprising of countries with more than 200 but less than 5000 mutation sequences is shown in Fig. 2 .
Fig. 2

Plot showing the distribution of the sequences in the data. Apart from United Kingdom and Denmark, the other countries actively tracking the variants of concern include USA, Australia, South Africa, and Switzerland.

Plot showing the distribution of the sequences in the data. Apart from United Kingdom and Denmark, the other countries actively tracking the variants of concern include USA, Australia, South Africa, and Switzerland.

Positional entropy calculations

The positional entropy is a measure of the randomness at the given position in the sequence [41]. To calculate the positional entropy for our dataset we use Shannon Entropy formulation stated in Equation (1) [42]:Where L is a list of all possible amino acids in all the sequences is the probability of finding the kth amino acid at that position. We use equation (1) to find the positional entropy for all the positions in the SARS-CoV-2 spike protein sequence. Using the dataset obtained from the GISAID server, we first pre-process the data using Biopython[43] to extract the sequences from the FASTA file downloaded from the server. We found that the length of the spike protein sequence varied from 1270 to 1278, the distribution of the sequence lengths is shown in Fig. S1. We also observed that the positions that contain ambiguous sites or unidentified amino acid in the spike protein sequence have been denoted with character “X” in the dataset. These positions with character “X” are handled by a masking operation that calculates the entropy without considering them [38]. We proceed by calculating the positional entropy values using equation (1) and all the values for the positional entropy are stored in an array. To identify the regions of high entropy that can possibly be associated with harmful mutations, we use a running mean (window length = 15, step size = 1), here the first positional index of the window gets assigned the value of the running mean. In the running mean calculation, we don't consider the first 60 and last 60 amino acids in the sequences because of the sequencing uncertainty. After calculating the running mean (window length = 15, step size = 1) for positional entropy, we stored it in another array. The array containing all the running means is then sorted and top 100 entropy values in the sequence are selected. Subsequently, we define the hotspots in the sequence as having ≥2 consecutive high entropy positions among the top-100 positional entropy values. For example: 210 and 211 both belong to the top 100 positional entropy values, and hence region 210–224 has been identified as a hotspot. To ensure both the positions (210 & 211) are included, we select the lowest index (210) as the start position of the hotspot and next 15 positions (included in the running mean) are considered as the hotspot (210–224). Additional details about the distribution of sequence lengths (Fig. S1) in the data and the starting positions of running mean windows for the top 100 positional entropy values are provided in the supplementary information (Table S1).

Prot-BERT model

The Prot-BERT trained on the UniRef100 dataset was used to generate sequence embeddings [26]. The Prot-BERT model has 30 layers, 16 attention heads, and embedding hidden size 1024. The Prot-BERT model was chosen because the embeddings generated have been used for different downstream tasks successfully increasing our confidence in using the same. We generate the embedding for the spike proteins of the mutated sequences using the pre-trained model on the hugging face api [44]. The Hugging face interface allows the users to easily use the pre-trained models on various Natural Language Processing (NLP) tasks. The curated data containing the unique sequences of spike protein were entered in the pre-trained Prot-BERT model and an embedding of size 1024 for every sequence. These embeddings are then used to study similarities and understand distributions between the mutations via K-Means clustering.

K means

Clustering is an unsupervised learning technique used to group a collection of unlabeled data sharing similarities. Each cluster comprises data sharing common traits which are distinct from members of other clusters, thereby resulting in clusters with high internal homogeneity and high external heterogeneity [45]. Clustering can be broadly classified into two categories, hierarchical and non-hierarchical clustering. The k-means clustering technique used in this study is a non-hierarchical clustering approach. This technique involves defining the number of clusters ‘k’. Each cluster is represented by a central location defined as the centroid, where k is the cluster number and j are the number of attributes. The algorithm allocates each data point to the nearest cluster by minimizing the distance from centroid. It starts off by randomly assigning centroids and thereafter continues as an iterative process to optimize the centroid locations depending on the points assigned to that cluster. This process continues until there is no further change in the centroid values or until the maximum number of iterations is reached [46]. Clustering is one of the most important data mining techniques to group unlabeled data based on common traits. In this work, we used K means clustering to group the different mutations based on similarities in properties. The embeddings generated using the ProtBert model were used as features for the clustering model. To perform k-means clustering we use the scikit-learn library, that builds k-means model under the hood after entering the model parameters [47,48]. The number of clusters chosen for this task was 10, based on the number of different mutation types being 10 and also because we got the highest silhouette score of 0.7228 [49] when using 10 clusters. We also implemented the MST-kNN clustering technique but the algorithm did not perform very well, it had a very low silhouette score of −0.7638 and hence was not used for any further clustering analysis. We use the silhouette scores metric as it is a measure of how well an algorithm can differentiate between different clusters in the data. The score varies from −1 to +1 and high silhouette score indicates that the datapoints have been clustered appropriately, with similar datapoints clustered together and dissimilar datapoints clustered differently. Other parameters for k-means such as the maximum number of iterations was chosen to be 1000 and the total number of initializations was chosen as 50 after multiple trials with other values in order to stabilize the cluster formation.

Results

Positional entropy

The advantage of analyzing the entropy lies in the fact that sequential entropy is correlated to molecular motility is an important factor for the mutation [7,23,24]. Furthermore, studies have found a significant relationship between these high entropy hotspot regions of the viral sequence and enhanced virulence in the mutations associated with these regions, which have had a crucial role in the evolution of this disease. Hence, these sites are regions of interest in vaccine development and medicine formulation[38]. We calculated the positional entropy for all positions of the spike protein genomic sequence and have estimated the mutational hotspot regions in these viral sequences. Table 1 highlights some of these regions of interest we have identified which correspond to some of the most dominant mutations that have been noted in various countries. From this analysis, we have noted that the regions of interest have successfully captured the D614G mutation, which is one of the most dominant mutation and is found to enhance the replication of SARS-CoV-2 in the lung cells [50]. The regions of interest also captured the following mutations - A222V, N439K, Y453F, S477 N, N501, D614G and V1122L [12].
Table 1

Hotspots found by analyzing the positional entropy. To determine a hotspot region a running mean (window length = 15, step size = 1) is calculated and top 100 value are selected. We found six such regions of interest in our analysis in which ten mutations of interest emerged.

HotspotsMutation
211–225A222V
439–453N439K, L452R, Y453F
473–487S477 N, T478K, E484K
487–501N501Y
602–616D614G
1121–1135V1122L
Hotspots found by analyzing the positional entropy. To determine a hotspot region a running mean (window length = 15, step size = 1) is calculated and top 100 value are selected. We found six such regions of interest in our analysis in which ten mutations of interest emerged. Apart from the above mutations, the following other mutations have also been correctly identified in our hotspots - E484K, T478K, and L452R. It has been shown that for the mutation, E484K along with the some mutations from B.1.1.7 lineage requires increased amounts of antibody serum to prevent infection [51] making it especially dangerous. Interestingly, our methodology is capable of capturing some of the potentially harmful mutations that may emerge in the future. For example: Our model that uses sequence data before 2020 identifies one of the hotspot regions from 439 to 453. A mutation of significance, L452R which was first identified by the California Dept of Public Health on 17th Jan 2021 [52] and was later found to be dominant mutation in the months of April and May 2021 worldwide. Similarly, another mutation E484K belonging to the B.1.25 family was recognized as variant of concern was recognized in South Africa in April 2021 [53]. This mutation lies in the region 473–487 which includes another mutation of significance S477 N [16,54]. This emergence of variants of concern from hotspot regions identified by our methodology demonstrates the accurate prediction of Shannon entropy based analysis. To further illustrate the positional entropy hotspots, we have plotted the positional entropy for the entire sequence of the spike protein of SARS-CoV-2 in Fig. 3 . Based on our analysis, we found nine other hotspot regions including 329–343, 386–400, 425–439, 530–544, 700–714, 763–777, 905–919, 955–968, 1172–1186. Based on validation analysis presented in Table 1 it is likely that the new mutation of concern may emerge in these hotspot regions.
Fig. 3

Variation of entropy and the position in the spike protein. Hotspots with higher likelihood of mutagenesis and high entropy have been marked in red in the plot. The red regions(hotspots) have the maximum mean entropy over a window of length 15. The blue regions in the plot indicate the regions of relatively lower mean entropy over the window of length 15. According to positional entropy analysis the dangerous spike protein mutations are more likely to emerge from the hotspots (red regions).

Variation of entropy and the position in the spike protein. Hotspots with higher likelihood of mutagenesis and high entropy have been marked in red in the plot. The red regions(hotspots) have the maximum mean entropy over a window of length 15. The blue regions in the plot indicate the regions of relatively lower mean entropy over the window of length 15. According to positional entropy analysis the dangerous spike protein mutations are more likely to emerge from the hotspots (red regions). To structurally understand the mutations further, we also identified the regions where the dangerous mutations belong in the structure of the spike protein. The analysis was based on study done by Huang et al., where they identify the different regions in the spike protein based on the positions in the sequence [55]. It must be noted that there are seven possible dangerous mutations in the receptor binding domain of the spike protein, these mutations are possibly more lethal because of their location on the binding interface. The locations of these mutations on the spike protein have been presented in Table 2 .
Table 2

Location of the mutations in the spike protein of the SARS-CoV-2, we have 3 regions of the spike protein where mutations can be located.

Spike Protein RegionMutation
N – Terminal domainA222V
Receptor-Binding DomainN439K, L452R, Y453F, S477 N, T478K, E484K, N501Y
Heptapeptide repeat sequenceV1122L
Location of the mutations in the spike protein of the SARS-CoV-2, we have 3 regions of the spike protein where mutations can be located. We also validate the mutations in Table 1 by using– EV mutation[56] methodology that determines the favorability of a mutation by calculating the prediction epistatic score. The data for mutation effect using EV mutation for SARS-CoV-2 is available on the server created by Ref. [57], we used the data from this server to analyze the epistatic mutation effect predict for mutations presented in Table 1. The novel aspect of the EV mutation method is its ability to take into account epistasis by taking into consideration the interactions between all pairs of amino acids residues in the neighborhood to quantify the mutational effects. A higher value of the prediction score using EV mutation indicates a highly favorable mutation. The analysis using EV mutation has been presented in Table 3 .
Table 3

Analysis of the SARS-CoV-2 mutations using EV mutation, the prediction epistatic score is an indicator of whether a mutation is fit or not fit. The higher score indicates that the mutation indicates that the mutation is a better fit. The third column indicates the rank among all the possible mutations at the site. The possible values for rank range from 1 to 19 as there are 20 amino acids and a single amino acids can mutate into 19 other amino acids. The rank depends on the EV mutation score, highest score will get rank-1 that indicates the mutation is highly favorable and lowest score gets rank-19 indicates that mutation is not favorable according to EV mutation calculations.

MutationPrediction epistatic scoreRank among all mutation possibilities
A222V0.54651
N439K−3.860510
L452R−6.148315
Y453F−6.56657
T478K0.41541
D614G−4.71442
V1122L−6.92949
Analysis of the SARS-CoV-2 mutations using EV mutation, the prediction epistatic score is an indicator of whether a mutation is fit or not fit. The higher score indicates that the mutation indicates that the mutation is a better fit. The third column indicates the rank among all the possible mutations at the site. The possible values for rank range from 1 to 19 as there are 20 amino acids and a single amino acids can mutate into 19 other amino acids. The rank depends on the EV mutation score, highest score will get rank-1 that indicates the mutation is highly favorable and lowest score gets rank-19 indicates that mutation is not favorable according to EV mutation calculations. Among the ten different mutations in Table 1, Table 3 presents the EV mutation score for seven different mutations. The data for S477 N, E484K and N501Y is unavailable on the server (Nathan Rollins*, Kelly Brock*, Joshua Rollins* et al., 2020), and hence is not presented in Table 3. We observe that A222V and T478K are highly favorable mutations as they have the highest possible prediction epistatic score among all mutations for the wild-type residue (A for site 222 and T for site 478). The D614G mutations is also highly favorable, and mutations Y453F, V1122L and N439K may be considered as moderately favorable. On the other hand, the mutation L452R may not be as favorable based on prediction epistatic score. The EV mutation scores validate most mutations identified in the hotspots from our methodology in Table 1, further indicating the calculating the positional entropy of the sequence can be a useful metric for identifying future mutation hotspots. The positional entropy formulation developed in this work used the data from the year 2020 and yet was able to identify some of the mutations that emerge later in April and May 2021 such as E484K and L452R validating our methodology further. We believe that our method may potentially be used to identify the dangerous mutations in advance and aid in the fight against the pandemic.

Clustering with K-means

The clustering analysis was done on the embeddings generated from the Prot BERT model. The embeddings for all the sequences are a 2D array of shape (sequence length, 1024) where 1024 is the hidden dimension of the model. Subsequently, we applied mean pooling to the sequence length dimension of the embeddings and generate a vector of dimension 1024 for each sequence. This 1024-dimensional vector is used for k-means clustering analysis. The cluster centers resulting from k-means clustering correspond to the different mutation types, thereby verifying our assumption that the different cluster types get grouped separately. We find that 7 out of 10 different mutations are identified as cluster centers with a few repeats. On analyzing the spike protein sequences that form the clusters and the sequence representative of the cluster center, we find that in most cases most of the sequences are identified to be of the same type as the cluster center whereas in most other cases the mutation type of the cluster center is amongst the top 3 mutation types present in the cluster, the other two types of possibly similar characteristics (Table 4 ). For example, from the plots (Fig. 4 ) show the clusters of S477 N and N439K have a majority of S477 N and N439K components. Furthermore, A222V has the second highest count in the cluster representing S477 N (Fig. 4) indicating similarities between them. D80Y is one of the majorities in the N439K cluster, thereby implying similarity in characteristics. In a study done by Ref. [58], it was found that A222V and S477 N are both stabilizing mutations thereby validating our findings that these two mutations may have some similar characteristics. This similarity analysis between the mutations is significant because when designing therapeutics that can counter new mutations understanding characteristics of mutations computationally can save a lot of experimental time and accelerate the therapeutic development process.
Table 4

Clusters where the top 3 dominant mutations in the cluster concur with the cluster center mutation. The top-3 dominant mutations are most likely to be similar in characteristics to the mutation in cluster.

Cluster CentersDominant Mutations in the Cluster
S477 NS477N, A222V, S98F, N439K
N439KN439K, D80Y, N501Y, H69-70
N501YD80Y, N439K, N501Y, H69
A222VV1122L, A222V, N501Y, S477 N
Fig. 4

A.) Clustering analysis for N439K mutation on the spike protein of SARS-CoV-2. After analyzing the cluster with cluster center as N439K we can conclude that the D80Y may have similar characteristics to that of N439K. B.) Clustering analysis for S477 N mutation on the spike protein of SARS-CoV-2. The majority of sequences in this cluster belong to the mutation S477 N and the next highest number is that of A222V suggesting similarity between them.

Clusters where the top 3 dominant mutations in the cluster concur with the cluster center mutation. The top-3 dominant mutations are most likely to be similar in characteristics to the mutation in cluster. A.) Clustering analysis for N439K mutation on the spike protein of SARS-CoV-2. After analyzing the cluster with cluster center as N439K we can conclude that the D80Y may have similar characteristics to that of N439K. B.) Clustering analysis for S477 N mutation on the spike protein of SARS-CoV-2. The majority of sequences in this cluster belong to the mutation S477 N and the next highest number is that of A222V suggesting similarity between them.

Conclusion

In this study, we developed a methodology to determine the hotspots for mutations in spike protein sequences of SARS-CoV-2. This study can enable us to know variants of interests beforehand so that therapeutics can be developed for them. We found fifteeen regions of interest in the sequence of the spike protein that may be the potential hotspots for novel mutations in SARS-CoV-2. Six of these hotspots contain ten mutations which have already been flagged as possibly more transmissible by the previous research. Interestingly, some of the new emerging variants from India and South Africa which have been marked dangerous in April 2021 and May 2021 were identified by our methodology even though we use the sequence data on the GISAID server before December 2020. Identifying hotspots beforehand may have implications in the development of therapeutics and be aware of the potential threats posed by the mutations in the virus. We also use the unsupervised learning-based clustering technique k-means to find the similarities between the variants of interests that have previously been found to be dangerous. The encode the protein sequences we use the Prot-BERT model and use features generated by it, for the k-means analysis. Clustering the mutation variants based on similarity reduces redundancy of time and resources, similar treatment techniques can be implemented for mutations that fall into the same cluster. One of the results of our analysis was the similarity between the S477 N and the A222V mutations, it implies that these mutations share common traits and occurrences and may be subjected to similar treatment strategies.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
  6 in total

1.  Immunoinformatics Analysis of SARS-CoV-2 ORF1ab Polyproteins to Identify Promiscuous and Highly Conserved T-Cell Epitopes to Formulate Vaccine for Indonesia and the World Population.

Authors:  Marsia Gustiananda; Bobby Prabowo Sulistyo; David Agustriawan; Sita Andarini
Journal:  Vaccines (Basel)       Date:  2021-12-09

2.  A proposed workflow for proactive virus surveillance and prediction of variants for vaccine design.

Authors:  Jordan J Baker; Christopher J P Mathy; Julia Schaletzky
Journal:  PLoS Comput Biol       Date:  2021-12-16       Impact factor: 4.475

3.  Luteolin Potentially Treating Prostate Cancer and COVID-19 Analyzed by the Bioinformatics Approach: Clinical Findings and Drug Targets.

Authors:  Yu Ye; Ziyan Huang; Manying Chen; Yongfeng Mo; Zengnan Mo
Journal:  Front Endocrinol (Lausanne)       Date:  2022-02-01       Impact factor: 5.555

4.  Mapping Data to Deep Understanding: Making the Most of the Deluge of SARS-CoV-2 Genome Sequences.

Authors:  Bahrad A Sokhansanj; Gail L Rosen
Journal:  mSystems       Date:  2022-03-21       Impact factor: 7.324

5.  Considering epitopes conservity in targeting SARS-CoV-2 mutations in variants: a novel immunoinformatics approach to vaccine design.

Authors:  Mohammad Aref Bagherzadeh; Mohammad Izadi; Kazem Baesi; Mirza Ali Mofazzal Jahromi; Majid Pirestani
Journal:  Sci Rep       Date:  2022-08-18       Impact factor: 4.996

6.  IBPred: A sequence-based predictor for identifying ion binding protein in phage.

Authors:  Shi-Shi Yuan; Dong Gao; Xue-Qin Xie; Cai-Yi Ma; Wei Su; Zhao-Yue Zhang; Yan Zheng; Hui Ding
Journal:  Comput Struct Biotechnol J       Date:  2022-08-28       Impact factor: 6.155

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.