Literature DB >> 33856796

Identification and Characterization of Species-Specific Severe Acute Respiratory Syndrome Coronavirus 2 Physicochemical Properties.

Srinivasulu Yerukala Sathipati^1,2,3, Shinn-Ying Ho^2,4,5,6.

Abstract

There is an urgent need to elucidate the underlying mechanisms of coronavirus disease (COVID-19) so that vaccines and treatments can be devised. Severe acute respiratory syndrome coronavirus 2 has genetic similarity with bats and pangolin viruses, but a comprehensive understanding of the functions of its proteins at the amino acid sequence level is lacking. A total of 4320 sequences of human and nonhuman coronaviruses was retrieved from the Global Initiative on Sharing All Influenza Data and the National Center for Biotechnology Information. This work proposes an optimization method COVID-Pred with an efficient feature selection algorithm to classify the species-specific coronaviruses based on physicochemical properties (PCPs) of their sequences. COVID-Pred identified a set of 11 PCPs using a support vector machine and achieved 10-fold cross-validation and test accuracies of 99.53% and 97.80%, respectively. These findings could provide key insights into understanding the driving forces during the course of infection and assist in developing effective therapies.

Entities: Chemical Disease Gene Species

Keywords: SARS-CoV-2 classification; machine learning; physicochemical properties; support vector machines

Year: 2021 PMID： 33856796 PMCID： PMC8056951 DOI： 10.1021/acs.jproteome.1c00156

Source DB: PubMed Journal: J Proteome Res ISSN： 1535-3893 Impact factor: 4.466

Introduction

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is responsible for the COVID-19 pandemic that has spread around the globe since its first appearance in Wuhan, Hubei province of China, in early December.[1] As of 21 February 2021, the World Health Organization has reported 110.38 million confirmed cases and 2,446,008 deaths globally, becoming a major health concern. As of 18 February 2021, at least seven different vaccines across three platforms have been rolled out in countries. Coronaviruses are enveloped single-stranded positive-sense RNA viruses that belong to a large family of viruses that constitute a subfamily Orthocoronavirinae in the family of Coronaviridae.(2) The genome sequence of SARS-CoV-2 is closely related to severe acute respiratory syndrome coronavirus (SARS-CoV) and bat coronaviruses. It shares 79.6% sequence identity to SARS-CoV, and it is 96% identical to bat coronavirus.[3] Human coronavirus (HCoV) genomes encode four major structural proteins including the spike (S), envelope (E), membrane (M), and nucleocapsid (N) proteins.[4] Each protein plays a significant role in the structure of the virus and in other aspects of the replication process. The S glycoprotein of coronaviruses binds to appropriate receptors to facilitate viral entry into human host cells. SARS-CoV-2 uses the SARS-CoV receptor antigen converting enzyme 2 (ACE2) to enter into the host cell primed by TMPRSS2.[5] The S glycoprotein of SARS-CoV-2 and its entry into the host cell through ACE2 is well characterized.[6,7] The primary role of the N protein is to pack the viral genome into a nucleocapsid,[8] and it is considered to be a multifunctional protein in coronaviruses involved in the host cellular response to viral infection and replication.[9] The N protein is a key molecule in the egress and assembly of SARS-CoV, and transient expression of N is involved in the production of viruslike particles of coronaviruses.[10] The M protein plays an important role in virus assembly and in the production of viral particles.[11] Homotypic interactions of M proteins are involved in envelope formation,[12] and they contribute to the core stability of coronaviruses.[13] Elongated and compacted M proteins are associated with the flexibility and density of S proteins.[11] The interaction between the M and E proteins is involved in envelope formation and budding of coronavirus particles.[11] The coronavirus E protein is a minor component of the virus particles, but it plays an important role in virion assembly and virus host–cell interactions.[14] The absence of E protein in gastroenteritis coronaviruses blocks virus trafficking in the secretory pathway and prevents virus maturation.[15] Together, the structural and functional studies of the SARS-CoV-2 proteins can provide invaluable information about the binding potential of viruses to host cells, that is, information necessary for vaccine design. Each protein has distinct properties that allow it to perform its functions, and its interactions and dynamics depend on its physicochemical properties (PCPs). HCoV shares 89.1% nucleotide and 77.2% amino acid sequence similarity with some bat coronaviruses.[16] More importantly, the amino acid sequence comparison of the receptor-binding domain of the S proteins from HCoV and SARS-CoV showed that they shared only 73.8–74.9% sequence identities.[16] In addition, small changes in the amino acid sequence of the S protein are crucial for binding to its host. For instance, the bat SARS-like CoV strain cannot bind to human ACE2[17] due to minor amino acid differences from SARS-CoV. Therefore, knowing the HCoV protein PCPs and understanding the amino acid differences at the sequence level would be crucial for determining the mechanisms behind their species specificity and the functions of the HCoV proteins. Extensive efforts are being made to eradicate the COVID-19 pandemic; the number of COVID-19 tests is rapidly increasing and it produces a huge dataset, which makes it difficult to derive the key elements that are essential for treatment. Artificial intelligence and machine learning are playing a critical role in COVID-19, especially by decreasing the workload of medical experts using computed tomography scans to detect COVID-19.[18] Machine learning techniques can broaden the screening process and identify potential antiviral agents based on their protein structures and DNA sequences to predict the drug binding sites of SARS-CoV-2.[19] Therefore, machine learning methods are ideal tools for analyzing large volumes of data and for identifying promising candidates for treating COVID-19. In this study, we retrieved the protein sequences of 4320 coronaviruses from the Global Initiative on Sharing All Influenza Data (GISAID) and the National Center for Biotechnology Information (NCBI) databases. We constructed a dataset with 2225 human–host coronaviruses (HCoV) as positive samples and 2095 nonhuman–host coronaviruses (nHCoV) as negative samples. We used a support vector machine (SVM)-based optimization method called COVID-Pred to distinguish HCoV and nHCoV using their amino acid sequences. COVID-Pred uses an optimal feature selection algorithm called the inheritable bi-objective combinatorial genetic algorithm (IBCGA)[20] to select informative PCPs that are differentiated between HCoV and nHCoV. COVID-Pred identified 11 PCPs that are able to distinguish HCoV and nHCoV proteins. The objective of this study was to explore the PCPs and amino acid compositions that are specific to HCoV, which may be helpful in understanding how HCoV proteins function and may provide a guide for vaccine design.

Materials and Methods

Dataset

The protein sequences of 2225 HCoV were retrieved from the GISAID database (https://www.gisaid.org) on June 3, 2020, and 2095 nHCoV protein sequences were retrieved from the NCBI database. The initial dataset thus consisted of the protein sequences of 4320 coronaviruses. Since each amino acid sequence is crucial for the binding of coronaviruses to their hosts, we reduced the sequence identity to 90%. After removal of redundancy and sequence uncertainties, the final dataset consisted of 141 HCoV structural proteins of coronaviruses as positive samples, whereas 163 nHCoV S protein sequences were negative samples. Furthermore, the dataset was divided into training and test sets in a ratio of 7:3. There were 213 sequences (HCoV and nHCoV) in the training set and 91 sequences (HCoV and nHCoV) in the test set. Additionally, we used seven S protein sequences of HCoV from the NCBI database after sequence identity reduction for an independent test. All of the data set information is summarized in Tables S1,S2 (Supplementary Data 2).

Physicochemical Properties

This study used 531 PCPs retrieved from the AAindex database developed by Kawashima and Kanehisa[21] as candidate features to construct COVID-Pred to distinguish species-specific coronavirus proteins. The original coronavirus’ amino acid sequences were converted into numerical indices according to the 531 PCP values. The feature representation of the 531 PCPs is described as followswhere PCP(a) is the value of the a amino acid of the n physicochemical property. Collect the HCoV and nHCoV protein sequences from the dataset. Calculate the composition f(a) of a protein for the ith amino acid a of 20 amino acids to encode the protein sequence of variable length into a feature vector of length 531. Calculate the feature value of the n physicochemical property, PCP(n), of a coronavirus protein, where n = 1, 2, ..., 531.

Proposed COVID-Pred Method

To investigate the properties of the coronavirus proteins, we proposed the COVID-Pred method, which was customized using the SVM incorporating the optimal feature selection algorithm IBCGA. Inheritable Bi-objective Combinatorial Genetic Algorithm To construct the COVID-Pred method, IBCGA was used for feature selection. IBCGA is a well-known feature selection algorithm that has been used for solving biological problems such as cancer survival predictions,[22−24] protein function predictions,[25] and modeling gene regulatory networks.[26,27] IBCGA is an efficient global optimization technique with an intelligent evolutionary algorithm (IEA) to select a small set of informative features from a large pool of candidate features while optimizing the prediction performance. COVID-Pred utilized the SVM classifier for distinguishing the HCoV and nHCoV. In COVID-Pred, the SVM classifier was implemented in the LIBSVM package.[28] The radial basis function (RBF) kernel was used for the implementation of SVM in the LIBSVM package. The scoring function of the RBF kernel was computed in the feature space between the two data points, x and x. The RBF kernel function is defined as follows In IBCGA, the commonly used genetic algorithm (GA) terms such as gene and chromosome, represented as GA-gene and GA-chromosome, were used. The chromosome of IEA consists of m = 531 binary genes for selecting informative PCPs and two 4-bit GA-genes for encoding the parameters C and γ of SVM. The high performance of COVID-Pred arises from the simultaneous optimization of feature selection and fine-tuning of SVM using IBCGA. In COVID-Pred, numerical protein sequences encoded as 531 PCPs in the training dataset were used as the input. The IBCGA can simultaneously provide a set of solutions, X, where r = rend, rend 1, ..., rstart in a single run. The feature selection algorithm IBCGA used can be described as follows Step 1: (Initialization) Randomly generate an initial population of Npop individuals. In this work, Npop = 50, rstart = 50, rend = 10, and r = r. Step 2: (Evaluation) Evaluate the fitness value of all individuals using the fitness function, that is the prediction ACC in terms of 10-fold cross-validation. Step 3: (Selection) Use a conventional method of tournament selection that selects the winner from two randomly selected individuals to generate a mating pool. Step 4: (Crossover) Select two parents from the mating pool to perform an orthogonal array crossover operation of IEA. Step 5: (Mutation) Apply a conventional bit mutation operator to parameter genes and a swap mutation to the binary genes for keeping r selected features. The best individual was not mutated for the elite strategy. Step 6: (Termination test) If the stopping condition for obtaining the solution X is satisfied, output the best individual as the solution X. Otherwise, go to Step 2. Step 7: (Inheritance) If r > rend, randomly change one bit in the binary genes for each individual from 1 to 0; decrease the number r by one and go to Step 2. Otherwise, stop the algorithm. Step 8: (Output) Obtain a set of m PCPs from the chromosome of the best solution X among the solutions X, where r = rend, rend + 1, ..., rstart.

Weka Classifiers

We used eight famous machine learning methods in Weka data mining software[29] to distinguish HCoV and nHCoV for performance comparison with COVID-Pred. They were Naive Bayes, multilayer perceptron (MLP), sequential minimal optimization (SMO), stochastic gradient descent (SGD), logistic model tree (LMT), J48, decision tree, and random forest. The classifier subset evaluator and the best first search were used for feature selection to design classifiers for distinguishing HCoV and nHCoV.

Evaluation Metrics

We evaluated the predictive performance of COVID-Pred using the following evaluation metrics: sensitivity (SN), specificity (SP), Matthews correlation coefficient (MCC), accuracy (ACC), and area under the ROC curve (AUC).

Amino Acid and Dipeptide Compositions

Amino acid composition (AAC) was measured for the HCoV and nHCoV. For the 20 amino acids denoted as A1...A20, the frequency of each amino acid (Af) was measured for the protein sequence length (L). AAC is represented as follows Dipeptide composition (DPC) is defined as pairs of amino acids denoted as dipeptides, AA (i.e., AA, AC.... YY), and the frequency of occurrence of dipeptides is defined as df. The DPC is computed aswhere n = df1,1 + df2,2 + ... + df20,20.

Results

Identification of SARS-CoV-2 Proteins

The objective of this study was to identify and analyze the PCPs that are specific to different coronavirus species and to explore the crucial driving forces that are involved in HCoV protein functions. For this purpose, 4320 protein sequences from HCoV and other organisms (nHCoV) in FASTA format were extracted. After preprocessing the initial dataset, the final dataset consisting of 141 HCoV and 163 nHCoV protein sequences was obtained from the GISAID and NCBI databases. The COVID-Pred method was established using the SVM incorporating the optimal feature selection algorithm IBCGA to identify the PCPs that could distinguish between HCoV and nHCoV. COVID-Pred selected 11 PCPs and achieved 10-fold cross-validation (10-CV) ACC, SN, SP, MCC, AUC, test ACC, and test AUCs of 99.53%, 1.00, 0.99, 0.99, 0.996, 97.80%, and 0.991, respectively. COVID-Pred obtained 100% (7/7) accuracy on an independent dataset consisting of seven HCoV S protein sequences. The COVID-Pred performance was evaluated using ROC curves as shown in Figure S1 (Supplementary Data 1). Next, the prediction performance of COVID-Pred was compared with some machine learning methods of the Weka classifier using the full dataset (n = 304). We used the classifier subset evaluator and the best first search for the feature selection and selected 28 features to distinguish HCoV and nHCoV. Eight standard classifiers such as Naive Bayes, MLP, SMO, SGD, LMT, J48, decision tree, and random forest were used for the performance comparison. The Naive Bayes classifier achieved 10-CV ACC, MCC, SN, SP, and AUC of 89.80%, 0.80, 0.98, 0.84, and 0.96, respectively; MLP achieved 92.10%, 0.84, 0.96, 0.88, and 0.96, respectively; SMO achieved 88.15%, 0.77, 0.98, 0.82, and 0.87, respectively; SGD achieved 91.77%, 0.84, 0.98, 0.87, and 0.91, respectively; LMT achieved 90.78%, 0.81, 0.93, 0.88, and 0.95, respectively; J48 achieved 92.43%, 0.84, 0.93, 0.91, and 0.93, respectively; decision tree achieved 83.55%, 0.69, 0.96, 0.76, and 0.82, respectively; and random forest achieved 96.38%, 0.92, 0.97, 0.95, and 0.98, respectively. COVID-Pred obtained 10-CV ACC, MCC, SN, SP, and AUC of 99.67%, 0.99, 1.00, 0.99, and 0.99, respectively. The prediction performance of COVID-Pred was better than those of the other machine learning methods, as shown in Table . The COVID-Pred method achieved mean performance, 10-CV, MCC, SN, SP, and AUC of 99.38 ± 0.11, 0.98 ± 0.003, 0.99 ± 0.003, 0.99 ± 0.001, and 0.99 ± 0.001, respectively.

Table 1

Performance Comparisons of COVID-Pred

	10-CV (%)	MCC	SN	SP	AUC
Naive Bayes	89.80	0.80	0.98	0.84	0.96
MLP	92.10	0.84	0.96	0.88	0.96
SMO	88.15	0.77	0.98	0.82	0.87
SGD	91.77	0.84	0.98	0.87	0.91
LMT	90.78	0.81	0.93	0.88	0.95
J48	92.43	0.84	0.93	0.91	0.93
decision tree	83.55	0.69	0.96	0.76	0.82
random forest	96.38	0.92	0.97	0.95	0.98
COVID-Pred	99.67	0.99	1.00	0.99	0.99
COVID-Pred (mean)	99.38 ± 0.11	0.98 ± 0.003	0.99 ± 0.003	0.99 ± 0.001	0.99 ± 0.001

Informative PCP Characterization

We ranked the identified 11 PCPs based on their prediction performance using the main effect difference (MED). A larger MED score indicates a greater contribution toward prediction accuracy. The identified 11 PCPs and their corresponding ranks and MED scores are listed in Table . The identified 11 properties, including FAUJ880103, ONEK900101, PALJ810116, AURR980102, FAUJ880106, TANS770103, FASG760101, MONM990101, AURR980116, DAYM780201, and RICJ880117, were analyzed further to explore their roles in SARS-CoV-2 proteins.

Table 2

MED Analysis

rank	AAindex-ID	AAindex-desc	MED
1	FAUJ880103	normalized van der Waals volume	9.94
2	ONEK900101	delta G values for the peptides extrapolated to 0 M urea	9.33
3	PALJ810116	normalized frequency of turn in α/β class	8.05
4	AURR980102	normalized positional residue frequency at the helix termini N″’	6.83
5	FAUJ880106	STERIMOL maximum width of the side chain	6.56
6	TANS770103	normalized frequency of the extended structure	6.56
7	FASG760101	molecular weight	5.68
8	MONM990101	turn propensity scale for transmembrane helices	4.33
9	AURR980116	normalized positional residue frequency at the helix termini Cc	3.80
10	DAYM780201	relative mutability	1.91
11	RICJ880117	relative preference value at C″	0.50

Normalized van der Waals Volume

The top PCP based on the MED results was normalized by the van der Waals volume (FAUJ880103),[30] with a MED score of 9.94. Fauchère et al. measured the side chain parameters of the 20 amino acids. The relevance of the parameters for hydrophobicity and steric and electric properties of the amino acid side chains was assessed[30] in which the normalized van der Waals volume of the amino acid side chains was measured. There are different mechanisms involved in protein molecule interactions, including electrostatic forces, salvation forces, and van der Waals forces. Van der Waals forces act during interactions of proteins with other molecules.[31] Recently, stronger van der Waals interactions were found between SARS-CoV-2 and ACE2 compared to those between SARS-CoV and ACE2.[32] A molecular docking study on SARS-CoV-2 reported that van der Waals interactions play a major role in the binding process.[33] Yan et al. found that subtle amino acid changes improve the van der Waals interactions between SARS-CoV-2 and ACE2 and might determine the stronger interaction.[34] More amino acids that formed hydrogen bonds and van der Waals interactions were found at the SARS-CoV-2 interaction sites when compared to those at the SARS-CoV interaction sites. Wang et al. identified that the SARS-CoV-2-CTD binding interface has more amino acid residues forming van der Waals interactions than SARS-RBD that directly interacts with ACE2.[35] Stronger electrostatic and van der Waals interactions were observed between SARS-CoV-2 and ACE2 compared to those between SARS-CoV and ACE2.[36] We thus measured the normalized van der Waals volumes for HCoV and nHCoV according to FAUJ880103.[30] We observed that the average normalized van der Waals volumes for HCoV were slightly higher than those for nHCoV. The mean normalized van der Waals volumes obtained for HCoV and nHCoV were 0.17 ± 0.10 and 0.16 ± 0.09, respectively. Among the 20 amino acids, larger van der Waals volume differences were observed between HCoV and nHCoV for L, K, N, R, and V. Additionally, we also observed slightly larger van der Waals volumes for the HCoV S proteins compared to the nHCoV S proteins. The amino acids R, Y, K, and P showed larger differences in van der Waals volumes between the HCoV and nHCoV proteins, as shown in Figure S2A (Supplementary Data 1).

Delta G Values for the Peptides Extrapolated to 0 M Urea

The conformational preferences of the amino acids influence the secondary and tertiary structures of proteins. Aydin et al. reported a 50% α-helical content in a designed recombinant SARS-CoV S2 domain fusion protein.[37] Subsequent conformational changes at the helices are critical to the fusion of viral and host membranes and the release of the viral genome into the host cells.[38] Karyn et al. measured the free energy difference (ΔΔG0) values of amino acids by substituting them in the guest sites of alpha helices.[39] We further calculated the ΔΔG0 values for HCoV and nHCoV according to ONEK900101.[39] The mean ΔΔG0 values did not show much difference between HCoV and nHCoV, but among the 20 amino acids, L, N, K, and E showed the largest differences in ΔΔG0 values between HCoV and nHCoV. A slight difference in the ΔΔG0 value was observed between the HCoV and nHCoV S proteins in which amino acids R, Y, V, and K showed a larger difference in ΔΔG0 value compared to the others.

Normalized Frequency of Turn in α/β Class

The property of PALJ810116 is described as “Normalized frequency of turn in α/β class.”[40] Palau et al. calculated the conformational propensities of each amino acid for secondary structural alignments. The utilization of amino acids depends on the amount and topology of different secondary structures, and there are distinct preferences for α/β protein amino acids, such as I and V being the preferred amino acids in the α/β structures.[40] A circular dichroism spectroscopy study reported that a SARS-CoV-2 fusion peptide has an α-helical content.[41] Therefore, we measured the normalized propensities of α/β in HCoV and nHCoV. We observed a slight difference in the mean normalized α/β turns between HCoV and nHCoV with a mean normalized frequency of α/β turns of 0.048 ± 0.02 and 0.050 ± 0.03, respectively. Larger differences in the amino acids for this property between HCoV and nHCoV were observed for N, K, G, S, and Y. There was no mean difference observed for this property between the HCoV and nHCoV S proteins, but amino acids Y, R, P, G, and S showed a difference in the normalized frequency of turns in α/β between the HCoV and nHCoV S proteins. This analysis indicates that amino acid propensities at α/β structures of SARS-CoV-2 might play an important role in the ACE2 binding process.

Normalized Positional Residue Frequency at the Helix Termini N″

The property of AURR980102 describes the normalized positional residue frequency at the helix termini N″. Aurora and Rose examined the role of helix capping in the secondary structures of proteins and identified seven distinct capping motifs at the helices C-terminus and N-terminus, where each motif exhibits a pattern of hydrogen bonds with hydrophobic interactions.[42] Various experiments demonstrated that the capping stabilizes the α-helices in proteins[43−45] and mutations of interacting residues in the capping motifs affect protein stability.[46] According to a previous study,[42] the normalized frequency of Pro is higher in N-terminal motifs, and also Pro functions as a hydrophobic residue. The CoV S glycoprotein is characterized by a complex of heptad-repeated regions (HR1 and HR2). The amide groups at the N-terminus of HR2 are capped by Asn, which interacts with the amide group via ordered water molecules, which may be one of the influential factors that stabilize the S glycol protein.[47] To examine the helix capping preferences, we measured the normalized frequencies of amino acids at helix capping in HCoV and nHCoV. We observed a slight difference in the mean normalized positional residue frequency at the helix termini N″ between HCoV and nHCoV. Although there was no large mean difference between HCoV and nHCoV for this property, we observed a larger difference in the normalized positional residue frequency at the helix termini N″ for the amino acids N, L, K, E, and G between HCoV and nHCoV as well as for R, P, K, S, and E between the S proteins of HCoV and nHCoV.

Relative Mutability

Dayhoff et al. calculated the relative mutability of amino acids, which indicates the probability of amino acid changes in a given small evolutionary interval.[48] The genome analysis of SARS-CoV-2 revealed that nearly 80% of the recurrent mutations produced nonsynonymous changes at the protein level, and these mutations are possible candidates for continuing adaptation of SARS-CoV-2 to its novel human host.[49] The genetic analysis of SARS-CoV-2 discovered various mutations and deletions in coding and noncoding regions.[50] The rapid mutations of SARS-CoV-2 play important roles in the virus spread. Hence, we measured the relative mutability of HCoV and nHCoV according to DAYM780201.[48] There was a slight difference in the mean relative mutability between HCoV and nHCoV, 3.77 ± 2.24 and 3.9 ± 2.85, respectively. A larger difference in the relative mutability of amino acids between HCoV and nHCoV was observed for N, E, S, L, and T, and differences in the relative mutability of amino acids between the S proteins of HCoV and nHCoV were observed for R, S, V, N, and P. Furthermore, the mutations in the S proteins were compared with the reference strain SARS-like bat virus, which falls under nHCoV (SLCoVZXC21/2015). We observed 203 mutations in the S protein (PDBID:6ACJ) compared with the reference sequence SARS-like/Bat/Nanjing/SL-CoVZXC21/2015 (PDBID:6ACC). A detailed list of the mutations between these two structures is given in Table S3 (Supplementary Data 3). Furthermore, we performed a statistical analysis using t-test to identify the significant amino acids of the six PCPs across HCoV and nHCoV. The p <0.05 was considered as statistical significance in the analysis. A significant difference (p <0.005) in van der Waals volume between HCoV and nHCoV was observed for the amino acids K, L, N, R, and V. The amino acids L, N, K, and E showed a significant difference in ΔΔG0 values between HCoV and nHCoV. A significant difference in the amino acids for the normalized frequency of turn in α/β class between HCoV and nHCoV was observed for N, K, G, S, and Y. A significant difference in the normalized positional residue frequency at the helix termini N″ between HCoV and nHCoV was observed for the amino acids N, L, K, E, and G. A significant difference in the relative mutability of amino acids between HCoV and nHCoV was observed for the amino acids N, E, S, L, and T. Additionally, the other six properties identified were the STERIMOL maximum width of the side chain (FAUJ880106), normalized frequency of the extended structure (TANS770103), molecular weight (FASG760101), turn propensity scale for transmembrane helices (MONM990101), normalized positional residue frequency at the helix termini Cc (AURR980116), and the relative preference value at C″ (RICJ880117). Their differences between HCoV and nHCoV are shown in Figure . The amino acid compositional preferences for the 11 PCPs between the S proteins of HCoV and nHCoV are shown in Figure S2 (Supplementary Data 1). Graphical representations of the five analyzed properties, FAUJ880103, ONEK900101, PALJ810116, AURR980102, and DAYM780201, are shown in Figure . The comparison of the mutations between the S protein and the reference strain SARS-like bat virus is shown in Figure . Furthermore, we ranked the amino acids based on their compositional preference differences between HCoV and nHCoV for the 11 PCPs. The amino acid rank is proportional to the compositional preference difference, meaning that the rank one amino acid has the greatest difference between HCoV and nHCoV. The amino acids that show compositional preference differences for the 11 PCPs between HCoV and nHCoV are shown in Figure . The amino acids that show compositional preference differences for the 11 PCPs between the S proteins of HCoV and nHCoV are shown in Figure S3 (Supplementary Data 1). The identified 11 PCPs and their amino acid compositional preferences are reported in Table S4 (Supplementary Data ).

Figure 1

Figure 2

Graphical representation of the analyzed informative PCPs using the secondary structure of 6M0J as a model.

Figure 3

Visualization of the S glycoprotein with mutations. (A) Structure of the SARS-CoV S protein (PDB: 6acc, EM 3.6 Angstrom). (B) S glycoprotein (PDB: 6acj, EM 4.2 Angstrom) in complex with the host cell receptor ACE2 (green ribbon); mutations identified in the query sequences are shown as colored balls (based on the nearest residue if in the loop/termini region).

Figure 4

Normalized amino acid compositional preferences showing differences in the 11 PCPs between HCoV and nHCoV.

Comparison of PCPs between the HCoV and nHCoV proteins. (A) FAUJ880103, (B) ONEK900101, (C) PALJ810116, (D) AURR980102, (E) FAUJ880106, (F) TANS770103, (G) FASG760101, (H) MONM990101, (I) AURR980116, (J) DAYM780201, and (K) RICJ880117. Graphical representation of the analyzed informative PCPs using the secondary structure of 6M0J as a model. Visualization of the S glycoprotein with mutations. (A) Structure of the SARS-CoV S protein (PDB: 6acc, EM 3.6 Angstrom). (B) S glycoprotein (PDB: 6acj, EM 4.2 Angstrom) in complex with the host cell receptor ACE2 (green ribbon); mutations identified in the query sequences are shown as colored balls (based on the nearest residue if in the loop/termini region). Normalized amino acid compositional preferences showing differences in the 11 PCPs between HCoV and nHCoV.

Analysis of Amino Acid and Dipeptide Compositions

Amino acid differences in different proteins could shed light on how SARS-CoV-2 is functionally and structurally different from humans and other organisms. Hence, the AAC differences were measured between HCoV and nHCoV. The maximum amino acid compositional differences between HCoV and nHCoV were obtained for L, K, and N with a ±2% composition difference and for E, H, R, M, Y, Q, G, V, S, and T with a ±1% composition difference, while the other amino acids did not show any differences, as shown in Figure A. The AAC differences for all of the amino acids are listed in Table . When we compared the S proteins of HCoV and nHCoV, the maximum AAC difference was observed for the amino acid R with a 2% composition difference and for P, K, G, V, N, and V with a ±1% difference, as shown in Figure S4 (Supplementary Data 1).

Figure 5

Amino acid and dipeptide compositional analysis. (A) Amino acid compositional differences between HCoV and nHCoV and (B) heatmap showing dipeptide compositional differences between HCoV and nHCoV.

Table 3

AAC Difference between the HCoV and nHCoV Proteins

amino acids	HCoV	nHCoV	composition difference
L	11%	9%	2%
K	5%	4%	2%
E	5%	3%	1%
H	2%	1%	1%
R	4%	3%	1%
M	2%	2%	1%
W	1%	1%	0%
P	4%	4%	0%
D	5%	5%	0%
C	3%	3%	0%
F	5%	5%	0%
I	6%	6%	0%
A	6%	6%	0%
Y	4%	5%	–1%
Q	4%	5%	–1%
G	6%	7%	–1%
V	7%	8%	–1%
S	7%	8%	–1%
T	7%	8%	–1%
N	5%	7%	–2%

Amino acid and dipeptide compositional analysis. (A) Amino acid compositional differences between HCoV and nHCoV and (B) heatmap showing dipeptide compositional differences between HCoV and nHCoV. Dipeptides play an important role in folding and peptide binding. Therefore, DPCs were measured for the HCoV, nHCoV, and HCoV S proteins. The top five DPCs obtained for HCoV were LL, FL, LV, VL, and TL, while for nHCoV, they were LL, VN, NG, SV, and SL; for the HCoV S proteins, they were RR, VL, IA, SN, and SV. A heatmap showing the differences in the DPC of HCoV and nHCoV is shown in Figure B. These AAC and DPC differences may be important factors for functional and pathogenic divergence of SARS-CoV-2. Heatmaps showing the DPCs of HCoV and nHCoV are shown in Figure S5A,5B (Supplementary Data 1).

Conclusions

Currently, substantial efforts are being made to develop therapeutic strategies[51−53] to eradicate the COVID-19 health crisis. Identifying the informative PCPs of COVID proteins could assist in vaccine design and COVID prevention. Due to the potential role of machine learning in solving many biological issues, it is considered a suitable tool for COVID-19 research. Hence, machine learning-based prediction models for COVID-19 are necessary to identify and analyze the important biomarkers for vaccine design. Here, to explore the PCPs of HCoV, we developed COVID-Pred for identification of valuable information of COVID proteins that could help in understanding their functions. A dataset consisting of protein sequences from 4320 HCoV and nHCoV was retrieved from the GISAID and NCBI databases. COVID-Pred was developed for the identification of informative PCPs and for the prediction of species-specific coronavirus proteins. COVID-Pred selected 11 PCPs and achieved 10-CV ACC, AUC, test ACC, and test AUC of 99.53%, 0.996, 97.80%, and 0.991, respectively, and obtained 100% (7/7) accuracy on an independent data set consisting of seven HCoV S protein sequences. Further analysis of five informative PCPs revealed that van der Waals forces, α-helices, frequencies of amino acids at α/β turns, helix capping, and mutability played some significant roles in differences between HCoV and nHCoV proteins. First, the characterization analysis of these informative PCPs revealed that there was a slight difference observed in the van der Waals volume between HCoV and nHCoV. Second, a difference in the ΔΔG0 value was observed between the S proteins of HCoV and nHCoV in which the amino acids R, Y, V, and K showed a larger difference in ΔΔG0 value compared to the other amino acids. Third, a larger difference in the amino acids for PALJ810116 was observed between HCoV and nHCoV for N, K, G, S, and Y. Fourth, we observed a larger difference in the normalized positional residue frequency at the helix termini N″ for the amino acids N, L, K, E, and G between HCoV and nHCoV as well as for R, P, K, S, and E between the S proteins of HCoV and nHCoV. Fifth, a larger difference in the relative mutability of amino acids between HCoV and nHCoV was observed for N, E, S, L, and T, whereas the relative mutability of amino acids between the S proteins of HCoV and nHCoV was observed for R, S, V, N, and P. The mutational analysis showed the mutations in the S proteins compared with the reference strain SARS-like bat virus, which falls under nHCoV (SLCoVZXC21/2015). Furthermore, we observed a difference in the AACs and DPCs between HCoV and nHCoV. The amino acid and dipeptide compositional differences for specific amino acids and dipeptides were also observed between HCoV and nHCoV. We believe that these findings could be helpful in understanding the functions of COVID proteins, which will be invaluable in designing vaccines.

49 in total

Review 1. Coronavirus envelope protein: a small membrane protein with multiple functions.

Authors: D X Liu; Q Yuan; Y Liao
Journal: Cell Mol Life Sci Date: 2007-08 Impact factor: 9.261

Review 2. Helix capping.

Authors: R Aurora; G D Rose
Journal: Protein Sci Date: 1998-01 Impact factor: 6.725

3. GeNOSA: inferring and experimentally supporting quantitative gene regulatory networks in prokaryotes.

Authors: Yi-Hsiung Chen; Chi-Dung Yang; Ching-Ping Tseng; Hsien-Da Huang; Shinn-Ying Ho
Journal: Bioinformatics Date: 2015-02-24 Impact factor: 6.937

4. Tectonic conformational changes of a coronavirus spike glycoprotein promote membrane fusion.

Authors: Alexandra C Walls; M Alejandra Tortorici; Joost Snijder; Xiaoli Xiong; Berend-Jan Bosch; Felix A Rey; David Veesler
Journal: Proc Natl Acad Sci U S A Date: 2017-10-03 Impact factor: 11.205

5. The M, E, and N structural proteins of the severe acute respiratory syndrome coronavirus are required for efficient assembly, trafficking, and release of virus-like particles.

Authors: Y L Siu; K T Teoh; J Lo; C M Chan; F Kien; N Escriou; S W Tsao; J M Nicholls; R Altmeyer; J S M Peiris; R Bruzzone; B Nal
Journal: J Virol Date: 2008-08-27 Impact factor: 5.103

6. Comparing the Binding Interactions in the Receptor Binding Domains of SARS-CoV-2 and SARS-CoV.

Authors: Muhamed Amin; Mariam K Sorour; Amal Kasry
Journal: J Phys Chem Lett Date: 2020-06-09 Impact factor: 6.475

7. An investigation into the identification of potential inhibitors of SARS-CoV-2 main protease using molecular docking study.

Authors: Sourav Das; Sharat Sarmah; Sona Lyndem; Atanu Singha Roy
Journal: J Biomol Struct Dyn Date: 2020-05-13

8. Immunological and inflammatory profiles in mild and severe cases of COVID-19.

Authors: Jin-Wen Song; Chao Zhang; Xing Fan; Fan-Ping Meng; Zhe Xu; Peng Xia; Wen-Jing Cao; Tao Yang; Xiao-Peng Dai; Si-Yu Wang; Ruo-Nan Xu; Tian-Jun Jiang; Wen-Gang Li; Da-Wei Zhang; Peng Zhao; Ming Shi; Chiara Agrati; Giuseppe Ippolito; Markus Maeurer; Alimuddin Zumla; Fu-Sheng Wang; Ji-Yuan Zhang
Journal: Nat Commun Date: 2020-07-08 Impact factor: 14.919

9. Influence of hydrophobic and electrostatic residues on SARS-coronavirus S2 protein stability: insights into mechanisms of general viral fusion and inhibitor design.

Authors: Halil Aydin; Dina Al-Khooly; Jeffrey E Lee
Journal: Protein Sci Date: 2014-03-19 Impact factor: 6.725

10. Neutralization of SARS-CoV-2 spike pseudotyped virus by recombinant ACE2-Ig.

Authors: Changhai Lei; Kewen Qian; Tian Li; Sheng Zhang; Wenyan Fu; Min Ding; Shi Hu
Journal: Nat Commun Date: 2020-04-24 Impact factor: 14.919

2 in total

1. SPIKES: Identification of physicochemical properties of spike proteins across diverse host species of SARS-CoV-2.

Authors: Srinivasulu Yerukala Sathipati; Ming-Ju Tsai; Tonia Carter; Sanjay K Shukla; Shinn-Ying Ho
Journal: STAR Protoc Date: 2022-05-24

2. Tracking the amino acid changes of spike proteins across diverse host species of severe acute respiratory syndrome coronavirus 2.

Authors: Srinivasulu Yerukala Sathipati; Sanjay K Shukla; Shinn-Ying Ho
Journal: iScience Date: 2021-12-02

2 in total