Literature DB >> 34815067

Feature-extraction and analysis based on spatial distribution of amino acids for SARS-CoV-2 Protein sequences.

Ranjeet Kumar Rout1, Sk Sarif Hassan2, Sabha Sheikh3, Saiyed Umer4, Kshira Sagar Sahoo5, Amir H Gandomi6.   

Abstract

BACKGROUND AND
OBJECTIVE: The world is currently facing a global emergency due to COVID-19, which requires immediate strategies to strengthen healthcare facilities and prevent further deaths. To achieve effective remedies and solutions, research on different aspects, including the genomic and proteomic level characterizations of SARS-CoV-2, are critical. In this work, the spatial representation/composition and distribution frequency of 20 amino acids across the primary protein sequences of SARS-CoV-2 were examined according to different parameters.
METHOD: To identify the spatial distribution of amino acids over the primary protein sequences of SARS-CoV-2, the Hurst exponent and Shannon entropy were applied as parameters to fetch the autocorrelation and amount of information over the spatial representations. The frequency distribution of each amino acid over the protein sequences was also evaluated. In the case of a one-dimensional sequence, the Hurst exponent (HE) was utilized due to its linear relationship with the fractal dimension (D), i.e. D+HE=2, to characterize fractality. Moreover, binary Shannon entropy was considered to measure the uncertainty in a binary sequence then further applied to calculate amino acid conservation in the primary protein sequences. RESULTS AND
CONCLUSION: Fourteen (14) SARS-CoV protein sequences were evaluated and compared with 105 SARS-CoV-2 proteins. The simulation results demonstrate the differences in the collected information about the amino acid spatial distribution in the SARS-CoV-2 and SARS-CoV proteins, enabling researchers to distinguish between the two types of CoV. The spatial arrangement of amino acids also reveals similarities and dissimilarities among the important structural proteins, E, M, N and S, which is pivotal to establish an evolutionary tree with other CoV strains.
Copyright © 2021 Elsevier Ltd. All rights reserved.

Entities:  

Keywords:  Amino acid; Frequency distribution; Hurst exponent; SARS-CoV-2; Shannon entropy

Mesh:

Substances:

Year:  2021        PMID: 34815067      PMCID: PMC8577876          DOI: 10.1016/j.compbiomed.2021.105024

Source DB:  PubMed          Journal:  Comput Biol Med        ISSN: 0010-4825            Impact factor:   6.698


Introduction

The novel coronavirus (COVID-19) has rapidly become a major global emergency that has and continues to affect all lives around the globe [[1], [2], [3]]. Presently, this disease, a pandemic as announced by the WHO, is a major health concern [4,5]. Currently, the largest genome (of size approximately 30 kb) for RNA viruses is known as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [6,7]. Coronaviruses (CoVs) are classified into three different classes, including -CoV, -CoV, and -CoV, based on genetic and antigenic criteria [8,9]. The SARS-CoV-2 is classified as -CoV [10] and has received widespread research attention across the world [[11], [12], [13]]. Every day, new genome sequences, as well as primary protein sequences of SARS-CoV-2, are being added to databases, such as the NCBI virus database [14,15] As of this writing, no antiviral drugs with proven efficacy nor vaccines for CoV2 prevention have been reported [16,17], while researchers have yet to attain a complete understanding of the molecular biology of SARS-CoV-2 infection [18,19]As a result, COVID-19 cases increase and have reached a global pandemic level, thus urgently requiring in-depth knowledge, infection mechanism, and other aspects of the virus-like forecasting its progression [18,20]. Although various protein-protein interactions (PPIs) of the virus and host are known, its viral infection mechanism is not fully understood [21,22]Therefore, identifying interactions between the SARS-CoV-2 virus proteins and host proteins will largely help to understand this mechanism and further develop treatments and vaccines [23]. As a first step, it is critical to gain clarity of SARS-CoV-2 proteins and PPIs between the virus and host proteins [24]. It is known that the protein fold depends on the number, spatial arrangement, and topological connectivity of secondary structure elements (SSEs) [25], yet the spatial arrangement of secondary structure elements (SSEs) is not well-understood [26]. Because the geometric three-dimensional structure of a protein depends on the spatial arrangement of the SSEs [27,28], both the spatial distribution and presence/absence of different amino acids over a primary protein sequence of SARS-CoV-2 are significant. It is also pertinent to mention that the spatial arrangement uncovers the rules that govern the folding of polypeptide chains, and the primary sequence of a protein reveals the molecular events in evolution [29,30]. Specifically, the alternation and spatial arrangement of amino acids over the primary sequence appear to affect the function and conformability of the protein, respectively [[31], [32], [33]]. In the present study, the spatial composition of 20 amino acids across the primary proteins of SARS-CoV-2 was examined according to the Hurst exponent and Shannon entropy. A frequency analysis of the amino acids was also conducted and further compared to a similar analysis for 89 genomes of SARS-CoV-2 [34]. The usability of Shanon entropy and Hurst exponent for analysis of protein sequences is reported in [29] which is to find out correlation among all these sequences.

Database and specifications

As of March 24, 2020, there are 944 known primary protein sequences of SARS-CoV-2 in the NCBI Virus Database () [35]. Out of these sequences, only 105 sequences are distinct, although these sequence data have been taken from wide ranges of geographic locations over the world. The complete list of 105 distinct sequences, which are denoted , , …, , with their corresponding accessions is provided at the end of the article in Appendix C. These 105 distinct protein sequences were considered in this study. The SARS-CoV and MERS-CoV, the SARS-CoV-2 genome comprises of 12 open reading frames (ORFs) in number. Genes encoding structural proteins such as spike (S), membrane (M), envelope (E), and nucleocapsid (N), are present in the remaining one-third of its genome spanning from the 5′ to the 3′ terminal, along with several genes encoding non-structural proteins (NSPs) and accessory proteins scattered in between is shown in Fig. 1 [36].
Fig. 1

Schematic representation of the coronavirus structure and genomic comparison of coronaviruses. (A) Representation of coronavirus showing different Components of the particle, which is 100–160 nm in diameter. The single-stranded RNA (ssRNA) genome, covered with the envelope and membrane proteins, gains Access into the host cell and hijacks the replication machinery. (B) The ssRNA of SARS-cov-2 is about 30 kb and has similarities with the genomes of SARS-CoV and MERS-CoV. Translation of this ssRNA results in the formation of two polyproteins, namely pp1a and pp1ab that are further sliced to generate numerous non-structural Proteins (NSPA). The remaining ORFS encode for various structural and accessory proteins that help in the assembly of the viral particle and evading immune response. This figure is taken from [36].

Schematic representation of the coronavirus structure and genomic comparison of coronaviruses. (A) Representation of coronavirus showing different Components of the particle, which is 100–160 nm in diameter. The single-stranded RNA (ssRNA) genome, covered with the envelope and membrane proteins, gains Access into the host cell and hijacks the replication machinery. (B) The ssRNA of SARS-cov-2 is about 30 kb and has similarities with the genomes of SARS-CoV and MERS-CoV. Translation of this ssRNA results in the formation of two polyproteins, namely pp1a and pp1ab that are further sliced to generate numerous non-structural Proteins (NSPA). The remaining ORFS encode for various structural and accessory proteins that help in the assembly of the viral particle and evading immune response. This figure is taken from [36]. The 20 amino acids are distinguished below: Essential amino acids: H, I, K, L, M, F, T, W, and V Conditionally essential: R, C, Q, G, P, and Y Non-essential: A, D, N, E, and S The replication of a virus depends on the availability of amino acids [37]. Because amino acids are required for protein synthesis, they play a crucial role in virus-related infections [38]. The absence of essential amino acids may result in empty virus particles that are free of viral nucleic acids [39]. Arginine (R) is a conditionally essential amino acid that is vital for virus replication and progression of virus infection. Carbon is the basic backbone of amino acids, which is attached to a carboxyl group (-COOH), amino group, (-NH2), hydrogen, and another group of atoms (R) [40]. The R group gives the amino acid its unique characteristics and distinguishes its interaction with other amino acids. Based on the structural and general chemical characteristics, R groups are classified as: Aliphatic: G, A, V, L, I Hydroxyl: S, C, T, M Cyclic: P Aromatic: F, Y, W Basic: H, K, R Acidic: D, Q, Z, N Herein, we represent the studied amino acids as corresponding to A, C, F, G, H, I, L, M, N, P, Q, S, T, V, W, Y, D, E, K, and R respectively. Each primary protein sequence was decomposed into 20 different binary sequences of and , according to the following rule: Given a primary protein sequence of SARS-CoV-2 for every amino acid , where to , put wherever is present and elsewhere put . Consequently, for every given primary protein sequence for all sequences , there are 20 binary sequences corresponding to the 20 different amino acids , . The length of these complete 105 primary protein sequences widely varies from 13 to 7097. One complete SARS-CoV-2 protein sequence, N99, has the smallest length of 13, and one protein sequence, N26, has the largest length of 7097. There are 6, 3, 8, 10, 3, and 48 sequences of lengths 121, 275, 419, 1273, 4405, and 7096 respectively, and the other sequences have unique length ranges. Then, all 105 sequences were grouped into six groups, excluding the individual sequences of different unique lengths. The complete list of 105 proteins with their corresponding lengths is given in Table 1 and Accession ID with details of 944 number of sequences are provided in Appendix C.
Table 1

Lengths of the 105 primary protein sequences.

SeqLengthSeqLengthSeqLengthSeqLengthSeqLengthSeqLength
N9913N9275N6638N137091N337096N537096
N8038N10275N100932N447095N347096N547096
N8143N11275N701272N147096N357096N557096
N6861N101290N691273N167096N377096N567096
N9675N105298N711273N177096N387096N577096
N9775N102306N721273N187096N397096N597096
N10383N104346N731273N197096N407096N607096
N98113N88419N741273N207096N417096N617096
N82121N89419N751273N217096N427096N627096
N83121N90419N761273N227096N437096N637096
N84121N91419N771273N237096N457096N647096
N85121N92419N781273N247096N467096N657096
N86121N93419N791273N257096N477096N667096
N87121N94419N41945N277096N487096N677096
N2139N95419N324405N287096N497096N267097
N15180N7500N364405N297096N507096
N3198N1527N584405N307096N517096
N8222N5601N127088N317096N527096
Lengths of the 105 primary protein sequences.

Proposed methods

To characterize the amino acid spatial distribution over the primary protein sequences of SARS-CoV-2, the Hurst exponent and Shannon entropy were applied as parameters, and the amino acid density/frequency analysis was performed. Unsupervised machine learning was mostly utilized for analysis of gene and genome sequences and also used for intra-protein analysis. Markov Clustering and Affinity Propagation procedures were compared directly to the method described in [41,42] and K-means clustering techniques in [43]. K-means algorithm is better for analyzing inter and intra class analysis of protein sequences [44]. A recent application of minimum variance cluster analysis for hierarchical agglomerative clustering technique was performed well and discussed in [45] and also identified groups of molecular systems to enhance insight into peptide dynamics. K-mean clustering algorithm is used to develop homogeneous subclasses inside the data. These data points in each cluster are as analogous as possible according to a widely used distance measure viz. Euclidean distance. Based on the performance and applicability one of the most commonly used simple clustering techniques is the K-means clustering [42,46]. In this paper, k-mean clustering algorithm has been used to generate 10 clusters for respective amino acids with the 105 SARS-CoV-2 datasets. The implementation of the spatial feature extraction has been performed using MATLAB-2016a version, on Microsoft 2010 OS. The statistical analysis of these spatial features is also analyzed with the help of STATISTICA 10.0 software in the upcoming sections. The following section briefly describes these methods with reference to similar works [[47], [48], [49]].

Hurst exponent of binary sequences

The HE lies in the interval , where HE is strictly less than for rough anti-correlated sequences and lies in the ranges - for positively correlated sequences. If HE = , then the sequence depicts its randomness with white noise [[50], [51], [52]]. The HE of a binary sequence is defined as given in Equ. 1 where n is the length of the sequence:where and , where and The autocorrelation of the binary representations of each amino acid over the SARS-CoV-2 protein sequences was obtained by measuring the Hurst exponent.

Shannon entropy

There are two kinds of Shannon entropy that were considered in this present study. • Binary Shannon entropy: The entropy of a Bernoulli process is measured with probability of the two outcomes , which is defined in equation (2):where frequency probabilities of 1's and 0's are respectively and ; is the length of the binary sequence; and is the number of 1's in the binary sequence of length [53]. The binary Shannon entropy is a measure of the uncertainty in a binary sequence. When probability , the event is certain to never occur; so there is no uncertainty, and entropy is . When probability , the result is certain; thus entropy must be . When , the uncertainty is at a maximum and consequently, the SE is . • Amino acid conservation Shannon entropy: Protein Post Translational Modification (PTM) is an important biological mechanism for expanding the genetic code [54,55]. To find the conservation of amino acids in primary protein sequences, Shannon entropy is deployed. For a given protein sequence, the SE is calculated as follows:where represents the occurrence frequency of amino acid in the sequence.

Amino acid density

Over the primary protein sequences of SARS-CoV-2, we aimed to explore the amino acid frequency distributions and corresponding statistical descriptions [11,56]. The density of the amino acids over a primary protein sequence can also be found using the following formula:where is an amino acid present in the primary protein sequence ; is the length of sequence ; and is the frequency of amino acid in sequence . This amino acid density would clarify the richness of essential amino acids in contrast to others.

Results and discussion

Herein, the positive/negative trend of the spatial distribution of the 20 amino acids over the SARS-CoV-2 protein sequences based on the Hurst exponent and Shannon entropy is reported. As mentioned earlier, the Hurst exponent implies the fractality (organized non-linearity) of the spatial representations. Also, the amount of uncertainty in the presence/absence of amino acids over the protein sequences was determined through Shannon entropy measurements, which provide conservation information about the amino acids. Based on the frequency distributions of all amino acids over the SARS-CoV-2 protein sequences, 14 SARS-CoV protein sequences were subsequently compared with 105 SARS-CoV-2 proteins.

Hurst exponent results

For the amino acid , the Hurst exponent (HE) was determined for the 105 binary sequences , where i = 1,2 …,20 and . Based on the HEs of the binary sequences of all primary protein sequences of SARS-CoV-2, ten clusters (C) are formed for amino acids A1, A2, A3, A4, A5, A6, and A7; eight clusters for A12, A18, A19, and A20; six clusters for A16 and A17; and five clusters for A8, A9, A10, A11, A13, A14, and A15. Table 2, Table 3 present the results for Amino Acids A1 and A2, respectively, while the corresponding tables for all other amino acids are given in Appendix A. The HE plot for the binary sequences and the corresponding histogram for all amino acids is shown in Figs. 2 and 3 respectively. It was anticipated that the HE of the binary representations for the ordering of amino acids over all the primary protein sequences reveals the autocorrelation among the amino acids.
Table 2

HE of 105 B_ (1_j) for j = 1, 2…105 corresponding to amino acid A_1 (A).

SeqHECSeqHECSeqHECSeqHECSeqHECSeqHEC
N800.5093N180.5847N420.5847N590.5867N10.6032N730.671
N40.5313N190.5847N450.5847N650.5867N50.6042N750.671
N1030.5626N210.5847N460.5847N290.5867N60.6052N760.671
N870.5747N230.5847N470.5847N880.5942N1000.6355N770.671
N1050.5787N240.5847N490.5847N890.5942N1040.6355N780.671
N200.587N250.5847N510.5847N900.5942N30.6415N790.671
N70.5817N270.5847N520.5847N910.5942N1020.6425N1010.6761
N810.5827N280.5847N530.5847N920.5942N150.6475N980.6978
N480.5827N300.5847N540.5847N930.5942N820.6495N960.70910
N500.5827N310.5847N550.5847N940.5942N830.6495N970.70910
N610.5827N330.5847N560.5847N950.5942N840.6495N20.7149
N430.5827N340.5847N570.5847N640.5847N850.6495N990.7189
N120.5837N350.5847N600.5847N660.5847N860.6495N90.7334
N130.5847N370.5847N620.5847N670.5847N740.6661N100.7334
N440.5847N380.5847N630.5847N320.5952N700.671N110.7334
N140.5847N390.5847N260.5847N360.5952N690.671
N160.5847N400.5847N80.5857N580.5972N710.671
N170.5847N410.5847N220.5867N680.5992N720.671
Table 3

HE of 105 B_(2_j) for j = 1,2, …105 corresponding to the amino acid A_2 (C).

SeqHECSeqHECSeqHECSeqHECSeqHECSeqHEC
N68*2N70.5676N790.61N330.61N570.61N320.61
N88*2N150.5766N700.61N340.61N590.61N360.61
N89*2N80.5786N130.61N350.61N600.61N580.61
N90*2N870.5837N440.61N370.61N610.61N1020.61
N91*2N980.597N30.61N380.61N620.61N40.68
N92*2N1040.597N140.61N430.61N630.61N20.68
N93*2N810.5947N160.61N450.61N640.61N10.78
N94*2N800.6131N170.61N460.61N650.61N60.78
N95*2N720.6151N180.61N470.61N660.61N90.75
N99*2N120.6171N190.61N480.61N670.61N100.75
N1000.53N690.6171N200.61N490.61N220.61N110.75
N1050.53N710.6171N210.61N500.61N250.61N50.710
N1030.53N730.6171N230.61N510.61N310.61N1010.79
N820.53N740.6171N240.61N520.61N390.61N960.74
N830.53N750.6171N270.61N530.61N400.61N970.74
N840.53N760.6171N280.61N540.61N410.61
N850.53N770.6171N290.61N550.61N420.61
N860.53N780.6171N300.61N560.61N260.61
Fig. 2

Shows the Plot of HEs and Histogram all the binary sequences, (a) and (b) for the amino acid (c) and (d) for the amino acid (e) and (f) for the amino acid (g) and (h) for the amino acid (i) and (j) for the amino acid (k) and (l) for the amino acid (m) and (n) for the amino acid (o) and (p) for the amino acid (q) and (r) for the amino acid (s) and (t) for the amino acid .

Fig. 3

Shows the Plot of HEs and Histogram all the binary sequences, (a) and (b) for the amino acid (c) and (d) for the amino acid (e) and (f) for the amino acid (g) and (h) for the amino acid (i) and (j) for the amino acid (k) and (l) for the amino acid (m) and (n) for the amino acid (o) and (p) for the amino acid (q) and (r) for the amino acid (s) and (t) for the amino acid .

HE of 105 B_ (1_j) for j = 1, 2…105 corresponding to amino acid A_1 (A). HE of 105 B_(2_j) for j = 1,2, …105 corresponding to the amino acid A_2 (C). Shows the Plot of HEs and Histogram all the binary sequences, (a) and (b) for the amino acid (c) and (d) for the amino acid (e) and (f) for the amino acid (g) and (h) for the amino acid (i) and (j) for the amino acid (k) and (l) for the amino acid (m) and (n) for the amino acid (o) and (p) for the amino acid (q) and (r) for the amino acid (s) and (t) for the amino acid . Shows the Plot of HEs and Histogram all the binary sequences, (a) and (b) for the amino acid (c) and (d) for the amino acid (e) and (f) for the amino acid (g) and (h) for the amino acid (i) and (j) for the amino acid (k) and (l) for the amino acid (m) and (n) for the amino acid (o) and (p) for the amino acid (q) and (r) for the amino acid (s) and (t) for the amino acid . The HE of the binary representation of the amino acids forming ten clusters ranges from to with a standard deviation between 0.0296 and 0.136. For amino acid , cluster 3 consists of two sequences, N4 and N80. For amino acid , clusters 3 and 6 contain 8 and 3 sequences respectively. Both the amino acids A1 and A2 have an HE of approximately 0.5, which depicts the random walk/Brownian motion-like character of the ordering of the amino acids over the corresponding protein sequences. For amino acid , 103 primary protein sequences excluding (N4 and N80) and almost all 105 SARS-CoV-2 protein sequences for amino acid are trending (persistent) sequences. For amino acid , clusters 4, 9 and 10 consist of seven binary representations with an HE of approximately 0.7 and for amino acid , cluster 4 contains two binary representations with an HE of approximately 0.734, which indicates positive autocorrelation (more persistent). The largest cluster i.e cluster 8 contains 65 sequences for the amino acid , cluster 5 contains 71 protein sequences for amino acid , and cluster 8 has 54 protein sequences for amino acid , which all have an HE approximately equal to and are positively autocorrelated/persistent. All binary spatial distributions of the 105 proteins for amino acid have positive autocorrelation and are consequently persistent/trending. One of the essential amino acid A5(H) is not present in the protein sequences N3, N80, N97, N98 and N99 of the SARS-COV-2. The spatial organization of amino acid H is random (neither trending nor negatively autocorrelated) in the protein sequences N5, N15, N88, N89, N90, N91, N92, N93, N94, and N95, which belong to cluster 2 as shown in Table 6 (Appendix A). Cluster 2 contains ten sequences (N68, N88, N89, N90, N91, N92, N93, N94, N95, and N99) with no HE (*), which indicates that the corresponding binary sequences , , , , , , , and are completely free from amino acid (C). Protein sequences N68 and N81 lack amino acid A4(G) (conditionally essential), as can be seen in Table 5 (Appendix A), while N99 is the only sequence that does not have essential amino acid A6(I). The spatial distribution of amino acid A6(I) over the protein sequence N102 is truly random since the HE is 0.509, whereas the other 104 sequences are trending with HEs greater than 0.5. The spatial arrangements of amino acid A7(L) over these proteins are neither random nor trending as the HE is greater than 0.5 but less than 0.6.
Table 6

Correlation matrix of SEs of present amino acids over the protein sequences.

r (SE)QSTVWYDEKR
A0.3210.290−0.019−0.367−0.143−0.4910.192−0.4810.0730.126
C−0.566−0.4020.0200.621−0.1520.530−0.2380.237−0.211−0.467
F−0.3000.037−0.5520.267−0.2520.181−0.253−0.261−0.840−0.539
G0.4940.0070.351−0.4540.059−0.2300.265−0.2120.3960.523
H−0.279−0.427−0.1120.2230.3630.3590.1720.565−0.019−0.284
I−0.225−0.223−0.1080.0930.3410.436−0.1910.309−0.245−0.292
L−0.606−0.086−0.2340.3550.1320.016−0.5160.184−0.424−0.356
M−0.244−0.4550.103−0.0010.3450.0220.0550.0740.098−0.117
N−0.0390.0100.220−0.021−0.227−0.089−0.024−0.424−0.0320.116
P0.411−0.0530.472−0.352−0.0510.2450.097−0.0690.4510.646
Table 5

Correlation matrix of HEs.

QSTVWYDEKR
A0.280−0.3420.2710.6670.5990.306−0.513−0.711−0.607−0.625
C−0.4340.0670.385−0.239−0.1010.6570.0620.2230.3080.246
F0.5380.061−0.2730.0510.265−0.1040.1070.0320.2300.122
G−0.3760.407−0.126−0.453−0.4390.1300.5980.7800.6600.702
H0.282−0.201−0.134−0.0950.1120.052−0.241−0.1400.0250.006
I0.027−0.374−0.142−0.278−0.2920.218−0.0660.1550.2790.339
L0.1030.0640.4910.3550.4000.5460.038−0.193−0.200−0.107
M−0.0960.034−0.053−0.333−0.2040.4430.3000.2810.3890.504
N0.5480.1020.0820.8060.6360.116−0.165−0.509−0.613−0.452
P0.1630.3850.2620.3760.240−0.0910.103−0.097−0.296−0.088
The HE of the binary representation of the amino acids forming eight clusters ranges from to with a standard deviation between 0.04 and 0.111. The binary representation of the spatial organization of nonessential amino acid A12(S) over the protein sequence N7 is negatively autocorrelated, whereas the other 104 binary representations corresponding to the protein sequences are positively trending (HE > 0.5). The largest cluster 2, contains 62 sequences for amino acid , cluster 1 has 48 sequences for amino acid , cluster 3 contains 58 protein sequences for amino acid , and cluster 1 consists of 70 protein sequences and sequences N98 and N102 for amino acid , which are positively trending, spatially. It is noteworthy that the spatial representations of amino acid S over the protein sequences N56, N13, N44, and N67 (belonging to cluster 2) all have an HE equal to 0.6, implying positive autocorrelation, while non-essential amino acid A18(E) does not appear in the protein sequences N80 and N99. The protein sequences N80, N81 and N99 are free from amino acid A19(K). The spatial organization of amino acid K over the protein sequence N103 is negatively trending due to an HE of . The conditionally essential amino acid A20(R) is not at all present in protein sequences N81 and N99, and consequently, the HE is not enumerable. The HE of the binary representation of the amino acids forming six clusters ranges from to with a standard deviation between 0.0434 and 0.884. The largest cluster, 1, contains 68 and 60 protein sequences for amino acids A16(Y) and A17(D), respectively, and is spatially spread with a positive trend. The conditional amino acid Y is absent from protein sequences N99 and N103. The spatial distribution of amino acid Y over the only protein N80 belonging to cluster 6 is not trending as its HE is . The spatial distribution of amino acid D over the protein, sequence N2 is random since its HE is 0.501. The HE of the binary representation of the amino acids forming five clusters ranges from to with a standard deviation between 0.0450 and 0.0903. Cluster 3 contains 80 sequences for amino acid A8(M) over the protein sequences, which has an HE of 0.61 (approx) indicating the trending behavior. The spatial distribution of the amino acid A9(N) (a non-essential amino acid) over the protein sequence N2 is reverse trending (negatively autocorrelated, HE = 0.488) as observed. In cluster 1 there are 54 sequences having a slow positive trend (HE = 0.55), whereas clusters 3, 4, and 5 contain positively trending spatial representations of amino acid A9(N) over the protein sequences. Cluster 1 contains 84 for 74 different protein sequences, where amino acid A10(P) is distributed spatially in a positively trending manner since the HE is approximately 0.56. There is only one binary representation of amino acid A11(Q) over protein sequence N100 that is negatively trending. In cluster 1, protein sequences N96 and N97 are absolutely free from amino acid Q. The spatial distributions of amino acid T over the 76 protein sequences (belonging to cluster 1) are positively trending. The largest cluster 2 contains 61 binary representations of the spatial distribution of the amino acid A14(V) over the corresponding protein sequences, which are random as the HE turned out to be 0.51(approx). The binary representation is random as the HE is 0.5 which depicts positive trending behaviour of the binary representation of the amino acid V over the protein sequence N8. The essential amino acid A15(W) is absent from protein sequences N80, N87, N96 and N99 and consequently, the binary representations , , and contain only zeros, and HE is in-computable as depicted in table 16 (Appendix A).

Collective view of HEs

The protein sequences of different lengths, ranging from 13 to 419, are provided below. Table 4 lists the amino acid(s) that are not present in the sequences.
Table 4

Absence of amino acids on various SARS-CoV-2 proteins.

Amino Acids: AbsentTypesSequences
CHydroxyl, Conditionally EssentialN68, N88, N89, N90, N95, N99
GAliphatic, Conditionally EssentialN68, N81
HBasic, EssentialN3, N80, N97, N98, N99
IAliphatic, EssentialN99
MHydroxyl, EssentialN99
PCyclic, Conditionally EssentialN81, N99, N103
QAcidic, Conditionally EssentialN96, N97
THydroxyl, EssentialN99
WAromatic, EssentialN80, N87, N96, N97, N99
YAromatic, Conditionally EssentialN99, N103
EAromatic, Non EssentialN80, N99
KBasic, EssentialN80, N81, N99
RBasic, Conditionally EssentialN81, N99
Absence of amino acids on various SARS-CoV-2 proteins. The protein sequence N99 of length 13 does not contain some essential, conditionally essential, and non-essential amino acids, including C, H, M, P, T, W, Y, E, K and R. The largest sequences N88, N89, N90, N91, N92, N93, N94, N95 of length 419 do not contain amino acid C. It is noted that amino acid M is present over all the protein sequences, except N99, which has the smallest length of 13. Also, it is has been observed that the essential amino acids L, M, F and V as well as non-essential amino acids A, D, N and S are present in all the protein sequences of SARS-CoV-2. In addition, the six conditionally essential amino acids were not found to be essential for all the proteins of SARS-CoV-2. Proteins that have a length greater than 419 contain all 20 amino acids. It is reported that the presence of amino acid I, G and V is of primordial importance, in this study it has also been found that N99 does not contain I and amino acid G is not present in N68, N81 sequences. It is also noted that amino acid H is randomly spatially distributed over protein sequences N5, N15, N88, N89, N90, N91, N92, N93, N94 and N95, as observed in the previous subsections. The essential hydroxyl amino acid M is randomly arranged over proteins N80 and N102. Also, amino acid L is distributed over the protein sequence N102 randomly, while only amino acid K is randomly spread over N104. In sequences N98 and N102, amino acid R is distributed with a negative trend (). Also, the amino acids K, Y, S, Q, N, and F are negatively trending over the protein sequences N103, N80, N7, N100, N2, and N5, respectively. Therefore, amino acids C, G, P, T, W, and E are distributed over all 105 proteins with positive autocorrelation (positively trending). Here, we explore the correlation (of trending behaviors) of the amino acid distribution over 105 proteins of SARS-CoV-2. The correlation matrix of ten amino acids, A, C, F, G, H, I, L, M, N and P, versus another ten amino acids Q, S, T, V, W, Y, D, E, K and R, is presented below. The spatial distribution of amino acid A with the same distribution of amino acids Q, T, V, W, and Y is positively correlated based on the HEs shown in Table 5 . Likewise, the HE of the spatial distribution of amino acid C is positively correlated with S, T, Y, D, E, K and R. Similarly, the positive correlations of the spatial distributions of amino acids F, G, H, I, L, M, N and P with the spatial distribution of other amino acids are established in the correlation matrix in Table 5. The correlation-based on HEs of the spatial distribution is also demonstrated in the graphs in Fig. 4 . It is worth mentioning that the correlation matrix (presented in Table 5) also displays the negative correlations of the spatial distribution of the proteins.
Fig. 4

The correlation plot of HEs of the distribution of amino acids M and Y.

Correlation matrix of HEs. The correlation plot of HEs of the distribution of amino acids M and Y. An example of the correlation (correlation coefficient r: 0.443) between the spatial distribution (autocorrelation) of amino acid M and the spatial distribution of amino acid L is given below in Fig. 5 .
Fig. 5

The correlation plot of HEs of amino acids M and L+.

The correlation plot of HEs of amino acids M and L+. The following subsection discuss the amount of uncertainty/certainty of the presence of amino acids over the protein sequences.

Shannon entropy results

For amino acids , the Shannon entropy (SE) was determined for the 105 binary sequences for i = 1 to 20 and. Results reveal that five clusters (C) formed for amino acids A1, A12, A13, A14, A15, A16, A17, A18, A19, and A20; six clusters for A4, A7,A8, A9, A10, and A11; seven clusters for A2 and A3; and eight clusters for A5 and A6, as presented in Appendix B. The SE plot for the binary sequences and the corresponding histogram for amino acid A1 is given in Figs. 6 and 7 (a) and (b) and for the rest of the amino acids it is shown in Appendix B. It was anticipated that the SE of the binary representations of the ordering of the amino acids over all the primary protein sequences would reveal the amount of uncertainty of the amino acids.
Fig. 6

Shows the Plot of SEs and Histogram all the binary sequences, (a) and (b) for the amino acid (c) and (d) for the amino acid (e) and (f) for the amino acid (g) and (h) for the amino acid (i) and (j) for the amino acid (k) and (l) for the amino acid (m) and (n) for the amino acid (o) and (p) for the amino acid (q) and (r) for the amino acid (s) and (t) for the amino acid .

Fig. 7

Shows the Plot of SEs and Histogram all the binary sequences, (a) and (b) for the amino acid (c) and (d) for the amino acid (e) and (f) for the amino acid (g) and (h) for the amino acid (i) and (j) for the amino acid (k) and (l) for the amino acid (m) and (n) for the amino acid (o) and (p) for the amino acid (q) and (r) for the amino acid (s) and (t) for the amino acid .

Shows the Plot of SEs and Histogram all the binary sequences, (a) and (b) for the amino acid (c) and (d) for the amino acid (e) and (f) for the amino acid (g) and (h) for the amino acid (i) and (j) for the amino acid (k) and (l) for the amino acid (m) and (n) for the amino acid (o) and (p) for the amino acid (q) and (r) for the amino acid (s) and (t) for the amino acid . Shows the Plot of SEs and Histogram all the binary sequences, (a) and (b) for the amino acid (c) and (d) for the amino acid (e) and (f) for the amino acid (g) and (h) for the amino acid (i) and (j) for the amino acid (k) and (l) for the amino acid (m) and (n) for the amino acid (o) and (p) for the amino acid (q) and (r) for the amino acid (s) and (t) for the amino acid . The SE of the binary representation of the amino acids forming five clusters ranges from to with a standard deviation between 0.0448 and 0.0919. The SE of the spatial distribution of amino acid in protein sequence N68 was determined to be 0.121, which is the lowest amount of uncertainly compared to the SE of other amino acids. In clusters 4 and 1, almost all the protein sequences had an SE less than 0.5, indicating the definite presence and absence of a particular amino acid over the protein sequences. The amount of uncertainly is high for protein sequences N3 and N99 with lengths of 198 and 13, respectively. Amino acids and are absent from protein sequence N99, with an SE less than 0.5, as shown in Tables 35 and 36, respectively. The amino acid (V) is present over all 105 proteins, and hence, none of the binary representations has SE = 0. For the amino acid V, the SE of N74 and N77 is 0.391, which implies the presence of this amino acid over the proteins has good certainty, and N96 and N97 have the maximum uncertainty of SE = 0.665. Cluster 1 contains five protein sequences, in which amino acid is absent, and hence, SE = 0. Also, SE = 0 for the binary spatial representations of N99 and N103 for amino acid , N80 and N99 (belonging to cluster 2) for amino acid , N80, N81 and N99 for amino acid , and N81 and N99 amino acid due to the absence of these amino acids. It is pertinent to note that amino acids and are present over all 105 proteins with certainty (. Most of the proteins in the largest cluster 2 including other clusters contain amino acid that is spatially distributed with certainty. The SE of the binary representation of the amino acids forming six clusters ranges from to with a standard deviation between 0.0749 and 0.852. Amino acid is absent from the primary protein sequences N68 and N81, and consequently, SE = 0 implies no uncertainty. Similarly, SE = 0 for the binary spatial representations of protein sequence N99 for amino acid , sequences N81, N99 and N103 for amino acid (P), and sequences N96 and N97 for amino acid (Q). Amino acid is spread spatially with certainty over the proteins N2 (length of 138) and N89, N90, N91, N92, N93, N94 and N95 (lengths of 419) in cluster 3. Clusters 1 and 5 for amino acid and cluster 1 for amino acids and contain the majority of the protein sequences, where the presence of these amino acids is spread over the proteins with almost certainty. Comparatively, clusters 2 and 6 contain five protein sequences, where the absence of the amino acid is spread with almost certainty. Cluster 3 contains one protein sequence N80 where the spatial distribution has SE = 0.562, which indicates that the absence of amino acid over the protein is without uncertainty. The SE of the binary representation of the amino acids forming seven clusters each ranges from to with a standard deviation between 0.0667 and 0.0765. It was found that SE = 0 for the spatial distribution of amino acid in the protein sequences N68, N88, N89, N90, N91, N92, N93, N94, N95 and N99, which indicates the amount of uncertainty is zero. In other words, the absolute absence of amino acid over these proteins and the spatial presence of amino acid C over the protein sequences of other clusters have low uncertainty (high certainty). The SE is greater than 0.5 for the binary representations of amino acid over the proteins N81 and N99, and consequently, the amount of uncertainty is lowering. In other clusters containing the other protein sequences, the spatial presence of amino acid over the protein sequences has low uncertainty (high certainty). The SE of the binary representation of the amino acids forming eight clusters ranges from to with a standard deviation between 0.0459 and 0.0749. Because amino acid is absent from proteins N3, N80, N97, N98 N99 and amino acid is absent from N99 (smallest length of 13), SE = 0 for the amino acids, implying there is no uncertainty. In addition, SE = 0.078 for the spatial representation of the presence and absence of amino acid over the proteins N88, N89, N90, N91, N92, N94 and N95 (lengths of 419) belonging to cluster 4); hence, the spatial distribution is more certain/orderly. All the clusters except cluster 6 contain only protein sequences over which amino acid is spatially distributed with certainty, whereas cluster 6 contains two sequences N81 (length of 43) and N68 (length of 61), where the absence of the amino acid dominates the presence with certainty.

Collective view of SE

It is pertinent to mention that SE = 0 for the binary representations of amino acid that is absent from protein sequence , which has been demonstrated in this study. It was also observed that maximum SE was obtained for the spatial distribution of amino acids over lengthy sequences, such as N99, N80, etc. Interestingly, for some given amino acid , the same SE was obtained for some spatial distributions of some protein sequences , irrespective of their lengths, for many values of . This essentially suggest that the probability of the presence of amino acid over these protein sequences is the same. Further, we explored the correlation in the amount of uncertainty between the spatial distributions of the 20 amino acids over the proteins of SARS-CoV-2. Table 6 presents the correlation matrix of ten amino acids (A, C, F, G, H, I, L, M, N and P) versus another ten amino acids (Q, S, T, V, W, Y, D, E, K and R). Correlation matrix of SEs of present amino acids over the protein sequences. Based on the SEs, the spatial distribution of amino acid A was found to be positively correlated with the distributions of amino acids Q, S, D, K and R, as shown in Table 6. Likewise, the spatial distribution of amino acid C is positively correlated with amino acids T, V, Y and E. Similarly, the positive correlations between the spatial distributions of amino acids F, G, H, I, L, M, N and P and the other amino acids are established in the correlation matrix in Table 6, which also shows negative correlations. The correlation-based on SEs of the spatial distribution is also demonstrated in the graphs in Fig. 9. An example of the correlation-based on SEs (the correlation coefficient r: 0.646) of the spatial distribution (autocorrelation) of amino acid R with the spatial distribution of amino acid P is given in Fig. 8 .
Fig. 9

Correlation plot of SE of the distribution of the amino acids distinct pairwise.

Fig. 8

Correlation plot of SEs of amino acids R and P.

Correlation plot of SEs of amino acids R and P. Correlation plot of SE of the distribution of the amino acids distinct pairwise.

Amino acid conservation shannon entropy

For each of the 105 protein sequences, the amino acid conservation information was determined through HE measurement, as described earlier. Based on the Shannon entropy () for each sequence, the clusters (C) were formed, and the respective SE plots and histograms for the 105 protein sequences are provided in Table 7 .
Table 7

Amino acid conservation shannon entropy.

SeqSECSeqSECSeqSECSeqSECSeqSECSeqSEC
N990.74N870.9361N780.9628N130.972N500.972N210.972
N810.8156N80.9393N750.9628N230.972N510.972N440.972
N970.8466N1010.9423N740.9628N370.972N250.972N240.972
N960.8625N20.9537N770.9628N490.972N260.972N330.972
N1030.8745N1040.9537N730.9628N640.972N450.972N280.972
N800.8795N90.9557N720.9628N660.972N460.972N270.972
N680.8925N70.9557N710.9638N600.972N140.972N520.972
N150.9219N820.9567N50.9638N120.972N310.972N470.972
N30.9259N60.9567N760.9638N650.972N390.972N620.972
N910.9289N110.9577N580.9658N560.972N570.972N340.972
N940.9289N100.9587N360.9658N410.972N160.972N220.972
N900.9289N840.9587N320.9658N550.972N290.972N670.972
N880.9289N850.9587N1050.9658N300.972N170.972N200.9712
N980.9289N830.9597N1020.9668N530.972N180.972N860.9732
N890.9289N40.9618N1000.972N590.972N190.972N10.98210
N920.9299N790.9628N420.972N400.972N350.972
N950.9311N700.9628N610.972N430.972N380.972
N930.9311N690.9628N630.972N480.972N540.972
Amino acid conservation shannon entropy. It can be observed that the Shannon entropy of amino acid conservation along the protein sequences of SARS-CoV-2 ranges from 0.7 to 0.982. Since the SE is close to 1, meaning uncertainty is at a maximum, all amino acids must be uniformly distributed over the protein sequences. More than 50% of the proteins sequences (54) belonging to cluster 2 of SARS-CoV-2 have SE = , which further implies that the amino acids are almost uniformly spread over the sequences. Subsequently, the frequency analysis of the amino acids over the proteins is given in the following subsection.

Frequency distribution of amino acids over the SARS-CoV-2 proteins

In this section, the frequencies of the amino acids in the 105 SARS-CoV-2 protein sequences are statistically compared, as shown in Figs. 10 and 11 .
Fig. 10

Comparative statistical details frequencies of the amino acids A, R, N, D, C, Q, E, G, H, I, L, and K over proteins.

Fig. 11

Statistical comparison between the frequencies of amino acids of M, P, S, T, W, Y and V over the protein sequences.

Comparative statistical details frequencies of the amino acids A, R, N, D, C, Q, E, G, H, I, L, and K over proteins. Statistical comparison between the frequencies of amino acids of M, P, S, T, W, Y and V over the protein sequences. A correlation matrix between the frequency distribution of amino acids over the 105 SARS-CoV-2 protein sequences is provided in Table 8 , and the respective correlation graphs are illustrated in Fig. 12 .
Table 8

Correlation matrix of the frequencies of amino acids.

LKMFPSTWYV
A0.9991.0000.9960.9970.9980.9980.9990.9970.9980.998
R0.9950.9970.9930.9940.9970.9960.9960.9950.9950.993
N0.9960.9960.9900.9990.9980.9990.9980.9930.9970.996
D0.9970.9980.9960.9970.9980.9970.9980.9960.9990.998
C0.9980.9960.9940.9990.9950.9960.9980.9930.9990.999
Q0.9890.9920.9820.9930.9980.9970.9940.9870.9890.988
E0.9990.9990.9970.9950.9940.9960.9980.9940.9980.998
G0.9970.9980.9920.9970.9990.9990.9990.9950.9960.995
H0.9960.9960.9970.9940.9920.9920.9950.9960.9980.997
I0.9980.9960.9910.9990.9970.9980.9980.9960.9980.998
Fig. 12

Correlation graphs for the amino acid frequencies.

Correlation matrix of the frequencies of amino acids. Correlation graphs for the amino acid frequencies. It can be observed that the correlation coefficient is very close to 1, which indicates significant correlations between the frequencies of each amino acid over the proteins. For instance, the correlation coefficient between the frequency distributions of amino acids A (Aliphatic) and K (Basic) is 1, as illustrated in Fig. 13 , means strong correlation.
Fig. 13

Frequency plots of amino acids A and K over 105 proteins.

Frequency plots of amino acids A and K over 105 proteins. Overall, it is observed that protein sequences of the same length have very similar frequency distributions of the twenty amino acids.

Spatial organization of proteins of SARS-COV

In 2003, the SARS coronavirus (SARS-CoV) had caused an epidemic in China including the other 22 countries [56,57]. There are 14 protein sequences available in the NCBI database (taxid: 722424). The list of proteins (S1, S2, S11) with their accessions are given here in Table 9 .
Table 9

List of SARS-CoV proteins with their Accession and length.

Accession IDSeqLength
ACU31036S1221
ACU31045S263
ACU31034S3274
ACU31035S476
ACU31038S544
ACU31041S670
ACU31042S74189
ACU31039S8422
ACU31037S9122
ACU31033S10114
ACU31040S1198
ACU31043S12121
ACU31044S136880
ACU31032S141241
List of SARS-CoV proteins with their Accession and length. It is noted that the protein with the accession ACU31032 (S14) is a spike protein of length 1241 as mentioned in the NCBI database. The spike protein (S-protein) is a large type I transmembrane protein of length not exceeding 1400 amino acids. The spike protein has an important function in the case of SARS-CoV [58,59]. Among all other proteins of SARS-CoV, spike protein is the main antigenic component that is responsible for inducing host immune responses, neutralizing antibodies, and/or protective immunity against virus infection [60]. We, therefore illuminate here the spatial representations of the amino acids over the spike protein including the other 13 proteins as mentioned in Table 10 . The HE, SE, and frequency distributions are given in the following and compared with the SARS-CoV2 proteins.
Table 10

HEs and SEs of 14 proteins of the SARS-CoV.

Hurst Exponent (HEs)
SeqACFGHILMNPQSTVWYDEKR
S10.5850.5710.6930.5940.6210.5220.6470.5930.6500.6260.6380.6140.5780.5990.6710.6340.6850.6210.6210.619
S20.6330.5570.5980.8050.5200.6200.5980.6490.5000.6760.5520.5960.5980.6330.6620.7240.7770.663
S30.7120.7050.5400.6270.5670.5060.7350.6480.6020.6900.5500.5880.6890.5310.5950.6870.6980.6270.5660.606
S40.7090.7330.6940.6250.5890.7000.5930.6410.6150.6470.6030.5740.6100.5930.6870.6510.590
S50.6080.5860.7010.6590.6760.5080.6930.6080.6080.6080.6080.5080.6080.6080.5740.7170.608
S60.6900.7280.5950.5490.6460.7000.6660.5950.5950.5840.6550.6460.5950.6830.5950.6600.6010.5550.634
S70.6050.6100.6630.6230.5730.5810.5890.6150.5580.5900.5990.6180.5760.5150.5550.6350.5780.7270.6310.588
S80.5540.6040.6480.5730.6000.6090.6040.6140.5960.6410.6950.5160.5360.5490.6440.6890.5480.7000.623
S90.6220.5850.5830.6450.5660.7360.6310.5830.6500.6600.6270.5660.6220.6070.5690.6290.6240.6100.649
S100.5400.5850.5210.5490.5490.6800.6730.6040.5850.5310.6550.6540.5810.6660.5110.5850.6640.527
S110.5140.6120.6320.6220.6370.6440.5660.5060.5890.5580.6650.6270.6410.5880.5530.6440.6120.665
S120.6540.6160.5110.6120.5300.4750.6820.5940.6430.6580.6250.4880.5310.6910.5830.5550.6600.5830.6210.602
S130.6010.6200.6220.5890.6080.6100.6140.6080.5860.5820.5620.6110.5840.5060.5540.6150.6090.7110.6070.585
S140.6880.6190.6100.5790.6350.5550.6270.6150.5920.5510.6490.5850.5760.5350.5640.6270.5980.5580.5770.584
Shannon Entropy (SEs)
SeqACFGHILMNPQSTVWYDEKR
S10.4230.1040.2850.3580.1040.4070.5850.2030.3230.1560.1310.3230.3040.3750.2030.2460.1560.2250.1800.375
S20.2030.0000.3410.0000.1180.6310.5030.2760.1180.2760.2030.2760.2760.3410.1180.2030.4000.4000.3410.276
S30.3500.1720.2750.2910.2080.3900.4980.1520.2260.2750.2430.3500.3900.4280.1520.3210.2750.1900.2590.110
S40.2970.2400.2970.1760.0000.2400.6890.1010.3500.1760.0000.4430.3500.6890.0000.2970.1010.2400.1760.176
S50.1560.2670.5750.0000.0000.5110.8110.2670.2670.1560.1560.1560.1560.2670.1560.1560.2670.4390.1560.000
S60.5540.3160.1080.1870.2550.2550.6610.1080.1080.2550.3710.2550.1080.4690.1080.1870.0000.4220.2550.255
S70.3850.2080.2600.3380.1390.2760.4790.1730.2760.2260.2090.3640.3720.4070.0810.2590.2820.3050.3220.215
S80.4040.0000.1980.4900.0930.1860.3340.1220.3050.3790.4120.4120.3870.1740.0930.1740.3050.1980.3700.379
S90.4090.2830.3800.2080.2470.3490.5610.0690.1210.2830.2080.3170.4370.2830.0000.2470.1210.3490.2830.283
S100.2190.0730.1760.1270.2970.3670.6700.3330.0730.1270.3980.4850.6080.3330.0000.1760.0000.0730.3980.127
S110.4080.0000.1440.1440.1440.2910.5070.1970.1970.4080.3320.3710.4430.5070.0000.0820.3320.2910.2460.291
S120.1210.3820.2850.2850.2480.3820.4390.2100.2100.3510.2480.3190.1210.4110.0690.3510.2850.3820.2100.248
S130.3770.2090.2710.3280.1550.2750.4570.1690.2910.2330.2080.3490.3620.4120.0860.2730.3070.2810.3210.229
S140.3600.1970.3160.3200.0840.3360.3990.1240.3360.2550.2900.4040.3960.3870.0680.2620.3060.2290.2830.213
HEs and SEs of 14 proteins of the SARS-CoV. It is observed that the spatial representations of the presence of all the amino acids over the spike protein S14 follow the positive autocorrelation (positively trending) as well as with the least amount of uncertainty of presence of the amino acids. It seems that the presence of all the amino acids is necessary to make a spike protein. It is worth mentioning that yet there are no identified spike proteins in the domain of 105 distinct proteins of SARS-CoV2. The amino acids A, F, I, L, M, N, P, S, T, V, Y, E, and K are all present over all these 14 proteins unlike in the case of SARS-CoV2 proteins as mentioned in subsection 3.21. It is worth mentioning that all the spatial distributions corresponding to different amino acids over the 14 proteins are positively autocorrelated with , except for the spatial distribution of the amino acid I and S over the protein S12 which is a hypothetical protein. It is noted that the HE is kept blank for the cases where the spatial distribution of an amino acid is completely a sequence of zeros i,e. absence of the amino acid over the protein. Below in Table 11 , we derive the correlation coefficients of the HEs of the spatial representations of the amino acids over the 14 SARS-CoV proteins.
Table 11

Correlation matrix of the HEs (Pairwise).

rQSTVWYDEKR
A−0.141−0.3850.5140.004−0.2440.2830.260−0.592−0.845−0.092
C−0.706−0.1010.814−0.288−0.3160.5350.307−0.046−0.752−0.077
F0.2630.807−0.159−0.4310.3050.253−0.3460.4370.4170.018
G−0.503−0.1590.4090.083−0.0520.2570.2850.3130.0910.264
H0.2980.6800.037−0.5250.1810.335−0.261−0.058−0.239−0.171
I−0.2560.723−0.039−0.806−0.4970.190−0.7580.6960.120−0.694
L−0.302−0.4570.5750.3710.3420.2430.865−0.497−0.5580.581
M−0.6540.2640.908−0.583−0.2860.7960.1380.096−0.758−0.144
N0.408−0.513−0.2290.8240.774−0.3670.761−0.6140.1180.798
P−0.392−0.4180.4560.4570.4120.1530.854−0.164−0.1430.712
Correlation matrix of the HEs (Pairwise). It is observed from Table 11 that the correlation coefficient (r) is 0.908 for the HEs of spatial representations of the amino acid M and T over all the 14 SARS-CoV proteins. Noted that overall the proteins, the presence of amino acid M and T are ensured. There is also another positive correlation that exists as can be seen in Table 11. It is noted that the SE is turned out to be zero for the cases where the spatial distribution corresponding to an amino acid that is absent over a protein. The spatial distribution of amino acids over the proteins of SARS-CoV is all without much uncertainty except for three cases where the SEs are greater than 0.5 where the absence of amino acids dominates in terms of certainty. The correlation coefficients of the SEs of the spatial distributions of the amino acids over the 14 SARS-CoV proteins are given in Table 12 . It is observed that the correlations among the SEs of the spatial distributions of the amino acids over the proteins are not significantly up as tabulated in Table 12. The highest positive correlation based on SEs of the spatial distributions of the amino acid C with that of Y is turned up as 0.572.
Table 12

Correlation matrix of the SEs of the spatial distributions of amino acids.

rQSTVWYDEKR
A0.2450.1090.1190.1230.032−0.190−0.273−0.0940.1080.500
C−0.311−0.355−0.5530.237−0.0090.572−0.3180.464−0.492−0.350
F−0.589−0.554−0.270−0.2870.2970.1640.2810.399−0.428−0.490
G0.2030.4250.152−0.1500.1400.3790.100−0.4260.1980.526
H0.5660.1510.173−0.128−0.2470.108−0.391−0.1240.4300.117
I−0.253−0.536−0.233−0.2620.407−0.0290.2980.351−0.133−0.294
L−0.363−0.363−0.1900.2290.030−0.245−0.5940.214−0.474−0.591
M0.123−0.1010.079−0.2370.162−0.3080.112−0.0890.168−0.345
N−0.4680.145−0.0800.1880.2680.3090.342−0.176−0.3910.060
P0.4380.025−0.079−0.103−0.210−0.1340.5180.1990.1620.500
Correlation matrix of the SEs of the spatial distributions of amino acids.

Discussion

Previous reports state that the genomes of SARS-CoV and SARS-CoV-2 exhibit similar protein sequences. However, we found that the spatial arrangement of amino acids over the studied protein sequences is certainly different, contributing to differences between proteins. This study reveals the hidden spatial arrangement of the amino acids of SARS-CoV-2 and SARS-CoV1. Specifically, the spatial arrangements of amino acids over the primary protein sequences of SARS-CoV-2 were examined according to the autocorrelation via Hurst exponent measurements and the presence/absence of the amino acids via Shannon entropy. Also, the frequency distribution of amino acids was analyzed to categorize the protein sequences. Based on a comparative analysis, the spatial distribution of 14 protein sequences of SARS-CoV demonstrated a significant difference from those of SARS-CoV-2. Conclusions are based on the calculated HE and SE, which provide information about the spatial arrangement of the amino acids over the primary protein sequences of SARS-CoV-2 as well as SARS-CoV. The obtained results, present in section 4, reveal the differences between the proteins of the two types of CoV. We firmly believe that our findings on the spatial distribution of the present/absent amino acids over the proteins enable a better understanding of the PPIs of SARS-CoV-2. For instance, the spatial arrangements reveal the similarities and dissimilarities among the important structural proteins E, M, N and S, which further helps to establish a more complete evolutionary tree among the other CoV strains. Despite our promising results, the present study is limited, as it did not consider the three-dimensional spatial structure of associate proteins, such as RdRp, E, M, N and S.

Authors’ contribution

SH had initiated the problem for the study, and RKR and SH executed the results from the data. SH, RKR, SS, SU, KSS, and AHG analyzed and interpreted the results. SH was a major contributor in writing the manuscript. All authors read and approved the final manuscript.
  44 in total

1.  Characterization of severe acute respiratory syndrome-associated coronavirus (SARS-CoV) spike glycoprotein-mediated viral entry.

Authors:  Graham Simmons; Jacqueline D Reeves; Andrew J Rennekamp; Sean M Amberg; Andrew J Piefer; Paul Bates
Journal:  Proc Natl Acad Sci U S A       Date:  2004-03-09       Impact factor: 11.205

2.  New classification of supersecondary structures of sandwich-like proteins uncovers strict patterns of strand assemblage.

Authors:  Yih-Shien Chiang; Tatiana I Gelfand; Alexander E Kister; Israel M Gelfand
Journal:  Proteins       Date:  2007-09-01

3.  A geometric algorithm to find small but highly similar 3D substructures in proteins.

Authors:  X Pennec; N Ayache
Journal:  Bioinformatics       Date:  1998       Impact factor: 6.937

4.  A multiple combined method for rebalancing medical data with class imbalances.

Authors:  Yun-Chun Wang; Ching-Hsue Cheng
Journal:  Comput Biol Med       Date:  2021-05-31       Impact factor: 4.589

5.  Database resources of the National Center for Biotechnology Information.

Authors:  Eric W Sayers; Jeff Beck; J Rodney Brister; Evan E Bolton; Kathi Canese; Donald C Comeau; Kathryn Funk; Anne Ketter; Sunghwan Kim; Avi Kimchi; Paul A Kitts; Anatoliy Kuznetsov; Stacy Lathrop; Zhiyong Lu; Kelly McGarvey; Thomas L Madden; Terence D Murphy; Nuala O'Leary; Lon Phan; Valerie A Schneider; Françoise Thibaud-Nissen; Bart W Trawick; Kim D Pruitt; James Ostell
Journal:  Nucleic Acids Res       Date:  2020-01-08       Impact factor: 16.971

6.  Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China.

Authors:  Chaolin Huang; Yeming Wang; Xingwang Li; Lili Ren; Jianping Zhao; Yi Hu; Li Zhang; Guohui Fan; Jiuyang Xu; Xiaoying Gu; Zhenshun Cheng; Ting Yu; Jiaan Xia; Yuan Wei; Wenjuan Wu; Xuelei Xie; Wen Yin; Hui Li; Min Liu; Yan Xiao; Hong Gao; Li Guo; Jungang Xie; Guangfa Wang; Rongmeng Jiang; Zhancheng Gao; Qi Jin; Jianwei Wang; Bin Cao
Journal:  Lancet       Date:  2020-01-24       Impact factor: 79.321

7.  Research and Development on Therapeutic Agents and Vaccines for COVID-19 and Related Human Coronavirus Diseases.

Authors:  Cynthia Liu; Qiongqiong Zhou; Yingzhu Li; Linda V Garner; Steve P Watkins; Linda J Carter; Jeffrey Smoot; Anne C Gregg; Angela D Daniels; Susan Jervey; Dana Albaiu
Journal:  ACS Cent Sci       Date:  2020-03-12       Impact factor: 14.553

Review 8.  COVID-19, an emerging coronavirus infection: advances and prospects in designing and developing vaccines, immunotherapeutics, and therapeutics.

Authors:  Kuldeep Dhama; Khan Sharun; Ruchi Tiwari; Maryam Dadar; Yashpal Singh Malik; Karam Pal Singh; Wanpen Chaicumpa
Journal:  Hum Vaccin Immunother       Date:  2020-03-18       Impact factor: 3.452

9.  Receptor-binding domain of SARS-CoV spike protein induces highly potent neutralizing antibodies: implication for developing subunit vaccine.

Authors:  Yuxian He; Yusen Zhou; Shuwen Liu; Zhihua Kou; Wenhui Li; Michael Farzan; Shibo Jiang
Journal:  Biochem Biophys Res Commun       Date:  2004-11-12       Impact factor: 3.575

10.  Structural Genomics of SARS-CoV-2 Indicates Evolutionary Conserved Functional Regions of Viral Proteins.

Authors:  Suhas Srinivasan; Hongzhu Cui; Ziyang Gao; Ming Liu; Senbao Lu; Winnie Mkandawire; Oleksandr Narykov; Mo Sun; Dmitry Korkin
Journal:  Viruses       Date:  2020-03-25       Impact factor: 5.048

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.