| Literature DB >> 34655896 |
Baishali Mullick1, Rishikesh Magar2, Aastha Jhunjhunwala3, Amir Barati Farimani4.
Abstract
The SARS-CoV-2 virus like many other viruses has transformed in a continual manner to give rise to new variants by means of mutations commonly through substitutions and indels. These mutations in some cases can give the virus a survival advantage making the mutants dangerous. In general, laboratory investigation must be carried to determine whether the new variants have any characteristics that can make them more lethal and contagious. Therefore, complex and time-consuming analyses are required in order to delve deeper into the exact impact of a particular mutation. The time required for these analyses makes it difficult to understand the variants of concern and thereby limiting the preventive action that can be taken against them spreading rapidly. In this analysis, we have deployed a statistical technique Shannon Entropy, to identify positions in the spike protein of SARS Cov-2 viral sequence which are most susceptible to mutations. Subsequently, we also use machine learning based clustering techniques to cluster known dangerous mutations based on similarities in properties. This work utilizes embeddings generated using language modeling, the ProtBERT model, to identify mutations of a similar nature and to pick out regions of interest based on proneness to change. Our entropy-based analysis successfully predicted the fifteen hotspot regions, among which we were able to validate ten known variants of interest, in six hotspot regions. As the situation of SARS-COV-2 virus rapidly evolves we believe that the remaining nine mutational hotspots may contain variants that can emerge in the future. We believe that this may be promising in helping the research community to devise therapeutics based on probable new mutation zones in the viral sequence and resemblance in properties of various mutations.Entities:
Keywords: Clustering; Mutations; SARS-CoV-2; Shannon entropy
Mesh:
Substances:
Year: 2021 PMID: 34655896 PMCID: PMC8492016 DOI: 10.1016/j.compbiomed.2021.104915
Source DB: PubMed Journal: Comput Biol Med ISSN: 0010-4825 Impact factor: 4.589
Fig. 1t-SNE plot capturing the distribution of the data collected from the GISAID server. Some of the variants of concern like N439K, N501Y are clustered near each other. From the t-SNE, we can easily infer that the SARS-CoV-2 mutations have unique characteristics.
Fig. 2Plot showing the distribution of the sequences in the data. Apart from United Kingdom and Denmark, the other countries actively tracking the variants of concern include USA, Australia, South Africa, and Switzerland.
Hotspots found by analyzing the positional entropy. To determine a hotspot region a running mean (window length = 15, step size = 1) is calculated and top 100 value are selected. We found six such regions of interest in our analysis in which ten mutations of interest emerged.
| Hotspots | Mutation |
|---|---|
| 211–225 | A222V |
| 439–453 | N439K, L452R, Y453F |
| 473–487 | S477 N, T478K, E484K |
| 487–501 | N501Y |
| 602–616 | D614G |
| 1121–1135 | V1122L |
Fig. 3Variation of entropy and the position in the spike protein. Hotspots with higher likelihood of mutagenesis and high entropy have been marked in red in the plot. The red regions(hotspots) have the maximum mean entropy over a window of length 15. The blue regions in the plot indicate the regions of relatively lower mean entropy over the window of length 15. According to positional entropy analysis the dangerous spike protein mutations are more likely to emerge from the hotspots (red regions).
Location of the mutations in the spike protein of the SARS-CoV-2, we have 3 regions of the spike protein where mutations can be located.
| Spike Protein Region | Mutation |
|---|---|
| N – Terminal domain | A222V |
| Receptor-Binding Domain | N439K, L452R, Y453F, S477 N, T478K, E484K, N501Y |
| Heptapeptide repeat sequence | V1122L |
Analysis of the SARS-CoV-2 mutations using EV mutation, the prediction epistatic score is an indicator of whether a mutation is fit or not fit. The higher score indicates that the mutation indicates that the mutation is a better fit. The third column indicates the rank among all the possible mutations at the site. The possible values for rank range from 1 to 19 as there are 20 amino acids and a single amino acids can mutate into 19 other amino acids. The rank depends on the EV mutation score, highest score will get rank-1 that indicates the mutation is highly favorable and lowest score gets rank-19 indicates that mutation is not favorable according to EV mutation calculations.
| Mutation | Prediction epistatic score | Rank among all mutation possibilities |
|---|---|---|
| A222V | 0.5465 | 1 |
| N439K | −3.8605 | 10 |
| L452R | −6.1483 | 15 |
| Y453F | −6.5665 | 7 |
| T478K | 0.4154 | 1 |
| D614G | −4.7144 | 2 |
| V1122L | −6.9294 | 9 |
Clusters where the top 3 dominant mutations in the cluster concur with the cluster center mutation. The top-3 dominant mutations are most likely to be similar in characteristics to the mutation in cluster.
| Cluster Centers | Dominant Mutations in the Cluster |
|---|---|
| S477 N | |
| N439K | |
| N501Y | D80Y, N439K, |
| A222V | V1122L |
Fig. 4A.) Clustering analysis for N439K mutation on the spike protein of SARS-CoV-2. After analyzing the cluster with cluster center as N439K we can conclude that the D80Y may have similar characteristics to that of N439K. B.) Clustering analysis for S477 N mutation on the spike protein of SARS-CoV-2. The majority of sequences in this cluster belong to the mutation S477 N and the next highest number is that of A222V suggesting similarity between them.