Literature DB >> 33283104

SSSCPreds: Deep Neural Network-Based Software for the Prediction of Conformational Variability and Application to SARS-CoV-2.

Hiroshi Izumi¹, Laurence A Nafie^2,3, Rina K Dukor³.

Abstract

Amino acid mutations that improve protein stability and rigidity can accompany increases in binding affinity. Therefore, conserved amino acids located on a protein surface may be successfully targeted by antibodies. The quantitative deep mutational scanning approach is an excellent technique to understand viral evolution, and the obtained data can be utilized to develop a vaccine. However, the application of the approach to all of the proteins in general is difficult in terms of cost. To address this need, we report the construction of a deep neural network-based program for sequence-based prediction of supersecondary structure codes (SSSCs), called SSSCPrediction (SSSCPred). Further, to predict conformational flexibility or rigidity in proteins, a comparison program called SSSCPreds that consists of three deep neural network-based prediction systems (SSSCPred, SSSCPred100, and SSSCPred200) has also been developed. Using our algorithms we calculated here shows the degree of flexibility for the receptor-binding motif of SARS-CoV-2 spike protein and the rigidity of the unique motif (SSSC: SSSHSSHHHH) at the S2 subunit and has a value independent of the X-ray and Cryo-EM structures. The fact that the sequence flexibility/rigidity map of SARS-CoV-2 RBD resembles the sequence-to-phenotype maps of ACE2-binding affinity and expression, which were experimentally obtained by deep mutational scanning, suggests that the identical SSSC sequences among the ones predicted by three deep neural network-based systems correlate well with the sequences with both lower ACE2-binding affinity and lower expression. The combined analysis of predicted and observed SSSCs with keyword-tagged datasets would be helpful in understanding the structural correlation to the examined system.

Entities: Chemical Disease Gene Mutation Species

Year: 2020 PMID： 33283104 PMCID： PMC7687297 DOI： 10.1021/acsomega.0c04472

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

In general, the effects of amino acid mutation on functions such as binding between proteins and expression are correlated.[1] The correlation between expression and binding suggests that mutations that improve stability and rigidity accompany increases in binding affinity.[2] Therefore, conserved amino acids located on the protein surface can be more successfully targeted by antibodies.[1] For this purpose, a quantitative deep mutational scanning approach is an excellent technique to understand viral evolution, and the obtained data can be utilized to develop a vaccine.[1] However, there are approximately 110.3 million nonredundant protein sequences in the RefSeq database,[3,4] and the application of the approach to all of the proteins in general is currently difficult. A deep learning-based prediction of the conformational rigidity may be available as a no-cost alternative. Many methods for sequence-based prediction of secondary and supersecondary structures have been developed in the past several years,[5−13] and many secondary structure prediction methods based on deep learning have also been reported.[14−18] Further, Zhang and co-workers have reported recently that the 3D structure prediction method C-I-TASSER incorporating a deep learning-based contact map prediction can create structural appearances of the full-length proteins.[19] However, the classification and prediction of fine-structured loops other than α-helixes, β-strands, coiled coils,[20−22] and disordered regions[23,24] remain elusive. There currently is no way to evaluate whether a particular protein sequence is flexible with the shape when cryo-electron microscopy (Cryo-EM) or X-ray structure of that sequence is not available as a guide. SSSCPreds, described in this work, is the first, and to date only, program that can simultaneously predict locations of protein flexibility or rigidity and the shapes of those regions with high accuracy. It does this by comparing different 3D conformation prediction programs that are based only on protein sequences. The detail of conformations could not be discussed by using only the appearance of a molecular model, but rather a comparison of the observed SSSC sequences with the predicted ones obtained from the examined systems as embodied in SSSCPreds, as described here, would be necessary. In the past decade, a means of identifying and codifying supersecondary structures (supersecondary structure code, SSSC) has been developed by us that uses the concept of Ramachandran plot data[25−27] with ω angles and the specification of positions of torsion angles in a protein. These data are derived from a fuzzy search of structural code homology using template patterns, represented as conformational codes, such as 3a5c4a (α-helix-type conformation) and 6c4a4a (β-sheet-type conformation), to describe supersecondary structural motifs and their conformation.[28,29] The SSSC is transcribed as a conformation propensity using the letters “H”, “S”, “T”, and “D” for each amino acid peptide unit referring to an α-helix-type conformation (H), a β-sheet-type conformation (S), a variety of other-type conformations (T), and disordered residues or the C-terminus (D). This code has been approved as a protocol for a molecular biology database[28] and can be used to distinguish the difference of characteristic loop structures between IgG immunoglobulin (SSSC: SHHSHSS) and IgM rheumatoid factor (SSSC: TTTSSSS).[28,29] On the other hand, interferon α, β, and γ, GroEL, and ubiquitin-associated domains have a unique common structure code motif (SSSC: HHHTTSHHH).[28] Recently, a deep neural network-based program for sequence-based prediction of SSSCs called SSSCPrediction (SSSCPred) was constructed first. Then, a comparison program (SSSCPreds that includes SSSCPred) of three deep neural network-based prediction systems (SSSCPred, SSSCPred100, and SSSCPred200) to predict the flexibility and conformational change of proteins was developed. SSSCPred alone does not indicate which part of amino acid sequences is flexible or rigid because additional reference data from SSSCPred100 and SSSCPred200 as contained in SSSCPreds are necessary for comparisons. As we were completing our SSSCPreds comparative analysis of the regions of flexibility of the SARS-CoV-2 protein, independent structure information became available by means of X-ray crystallography and Cryo-EM. This additional information strengthened the conclusions of our flexibility analysis that are based solely on the amino acid sequence data and are not constrained by the structure of the SARS-CoV-2 protein obtained in a single crystal (X-ray) or in a frozen aqueous medium (Cryo-EM). Nevertheless, despite the lack of structural constraint, the result of our SSSCPreds prediction methodology is completely consistent with the reported X-ray and Cryo-EM structures. To develop a vaccine against the coronavirus disease 2019 (COVID-19), which is currently prevalent all over the world, structural information on the virus is required.[30] The protein sequence of severe acute respiratory syndrome coronavirus (SARS-CoV) moderately resembles that of SARS-CoV-2 (about 79% identity).[30] Several observed structures of spike proteins in SARS-CoV[31−36] and SARS-CoV-2,[37−43] including Cryo-EM structures except the full postfusion state S2 proteins, have been registered in the PDB database[44] and are thus available for use in comparing the predicted SSSCs of SARS-CoV-2. On the other hand, the sequence identity of ORF8 (Open Reading Frame 8) proteins between SARS-CoV[45,46] and SARS-CoV-2[47,48] (about 24% identity) is very low. Although the structure of SARS-CoV-2 has been reported recently,[48] the structure of SARS-CoV has been unsolved yet. ORF8 disrupts IFN-I signaling when exogenously overexpressed in cells.[47] The other report that ORF8 of SARS-CoV-2, but not ORF8 or ORF8a/b of SARS-CoV, downregulates MHC-I in cells has been published.[48] For SARS-CoV, the full 122 amino acid protein encoded by ORF8 induces ATF6-dependent transcription, which triggers the expression of chaperones.[45] A 29-nucleotide deletion (Δ29), splitting of SARS-CoV ORF8 into ORF8a and ORF8b, is correlated with milder disease.[48] In this paper we present, for the first time, a sequence flexibility/rigidity map was obtained from a deep learning comparison program, SSSCPreds, that uses input from three structure prediction programs SSSCPred, SSSCPred100, and SSCPred200. This sequence flexibility/rigidity map resembles the sequence-to-phenotype maps[1] of the SARS-CoV-2 receptor-binding domain (RBD). As a particularly important and urgent demonstration of SSSCPreds, we assess the flexibility of the SARS-CoV-2 RBD and ORF8 and the rigidity of the nearby S2 region.

Results and Discussion

Translation of Amino Acid Sequences to SSSCs

The comparison of SSSCPrediction with Quick2D[8] was carried out by using the PDB file (1a00_A: HEMOGLOBIN ALPHA CHAIN). As shown in Figure , the main difference between SSSCPrediction and Quick2D was found in the structured loop regions. Only SSSCPrediction could predict the fine loop conformations. Although a direct comparison could not be made because of the difference of correct data between SSSCPrediction and other prediction methods, the concordance rates for the translation of amino acid sequences to SSSCs using 612 and 17,169 protein subunits containing at least 100 amino acid residues in the CB513 and CullPDB datasets[15] for the benchmark of SSSCPrediction were 0.88 and 0.86, respectively.

Figure 1

Comparison of SSSCPrediction with Quick2D.[8] The PDB file (1a00_A) was used for comparison (SSSCPrediction: H, α-helix-type conformation; S, β-sheet-type conformation; T, other-type conformation; D, disordered residue or C-terminus. Quick2D: H, α-helix; E, β-strand; D, disorder). The average concordance rate for the translation of amino acid sequences to SSSCs using the three test datasets comprising 10,000 FASTA files each was 0.90. A total of 3450 files in the test dataset had a concordance rate of ≥0.95, and 6000 files had a concordance rate of ≥0.90 (Figure ). In the past three decades, much progress has been made in the development of accurate predictors of the protein secondary structure. Recently, prediction accuracy has increased from about 82 to 84%, which is approaching the estimated upper accuracy limit of around 88%.[5,14−16] Again as stated above, although a direct comparison of accuracy is impossible due to the difference of correct data between SSSCPrediction and other prediction methods, these prediction accuracies, or concordance rates, are comparable.

Figure 2

Distribution map of the number of subunits per concordance interval. The average concordance rate of translation of amino acid sequences to SSSCs was 0.90.

Distribution map of the number of subunits per concordance interval. The average concordance rate of translation of amino acid sequences to SSSCs was 0.90. The correlation between keywords of subunit names in the training files and concordance rates was examined to understand more about the target subunits for SSSCPrediction. For files containing the keywords PROTEASOME, FAB, LYSOZYME, HEMOGLOBIN, MICROGLOBULIN, HLA, and MYOGLOBIN, the ratio of files with those keywords and a concordance rate of ≥0.90 to the total number of files with those keywords was extremely high (≥0.92; Table ). In contrast, for files containing the keywords ALKALINE (ratio of files with that keyword and a concordance rate of ≥0.90 to the total number of files with that keyword: 4/97), GLUCOSIDASE (81/245), OUTER MEMBRANE (158/362), ENVELOPE (122/271), PORIN (126/262), REPLICATION (134/271), INTERLEUKIN (265/472), and RIBOSOMAL PROTEIN (1259/1908), the concordance rate was much lower (<0.60); however, these keywords were sometimes found in files with a concordance rate of ≥0.90 and were associated with flexible conformations, and there were no keywords found only in files with a low concordance rate. In the 379,334 files in the overall dataset, the keywords KINASE (4219/6080), TRANSFERASE (3812/6010), SYNTHASE (2868/4159), REDUCTASE (3050/4302), DEHYDROGENASE (2545/3815), HYDROGENASE (2732/4120), POLYMERASE (1863/2888), HYDROLASE (1199/2041), PROTEASE (1344/1765), PHOSPHATASE (990/1690), ISOMERASE (1279/1912), and OXIDASE (1086/1682) frequently appeared, and the ratios of files with those keywords and a concordance rate of ≥0.90 to the total number of files with those keywords ranged from 0.59 to 0.76. Thus, there were no keywords associated only with a low concordance rate, and an identical amino acid sequence of a flexible protein may possibly have different SSSC sequences.

Table 1

Keywords Included in the Training Dataset Files that Afforded High Concordance Rates

keyword	files with a concordance rate of ≥0.90 (A)	total number of files (B)	A/B ratio
PROTEASOME	3283	3551	0.92
FAB	1989	2071	0.96
LYSOZYME	786	830	0.95
HEMOGLOBIN	760	825	0.92
MICROGLOBULIN	501	534	0.94
HLA	402	424	0.95
MYOGLOBIN	174	178	0.98

To confirm whether the flexibility and conformational change of proteins can be predicted or not, two additional deep neural network-based prediction systems were constructed by using procedures similar to that used to construct SSSCPrediction (SSSCPred). The benchmarks (average concordance rates) of the three systems were as follows: for SSSCPred200, 0.905 (CullPDB;[15] 9851 subunits) and 0.911 (CB513;[15] 361 subunits); for SSSCPred100, 0.896 (CullPDB; 17,169 subunits) and 0.907 (CB513; 612 subunits); for SSSCPred, 0.861 (CullPDB; 17,169 subunits) and 0.882 (CB513; 612 subunits). For CullPDB files, the total number of files with a concordance rate of <0.65 between SSSCPred200 and PDB data was 66. Of these CullPDB files, the ratio of files with a concordance rate of <0.70 between SSSCPred200 and SSSCPred100 data to the total number of files was 0.83 (see Table S1). For CB513 files, the total number of files with a concordance rate of <0.75 between SSSCPred200 and PDB data was 17. Of these CB513 files, the ratio of files with a concordance rate of <0.80 between SSSCPred200 and SSSCPred100 data to the total number of files was 0.59 (see Table S2). Exceptionally, in the CB513 files, the subunit with the keyword PHOSPHOGLYCERATE MUTASE 1 (3pgm_A) showed the high concordance rate (0.91) between SSSCPred200 and SSSCPred100 data in contrast with the low concordance rate (0.62) between SSSCPred200 and PDB data. In that case, the PDB files (1qhf_A, 1bq3_A, 4pgm_A, and 5pgm_A) of the same keyword with high concordance rates (0.96, 0.97, 0.96, and 0.99) between SSSCPred200 and PDB data were found. This means that the PDB files 3pgm_A and 1qhf_A have the identical amino acid sequence, but the SSSC sequences, which reflect the subunit flexibility, are largely different. For CullPDB files, the total number of files with a concordance rate of <0.65 between SSSCPred200 and SSSCPred100 was 80. Of these CullPDB files, the ratio of files with a concordance rate of <0.75 between SSSCPred200 and PDB data to the total number of files was 0.80 (64/80). The value size of concordance rates among the three systems (SSSCPreds) provides a good indication of the flexibility of the protein subunits.

Predicted and Observed SSSC Sequences of SARS-CoV-2 Proteins

Spike Protein RBD

We then compared the predicted and observed SSSC sequences of spike proteins of SARS-CoV-2 and SARS-CoV at the receptor-binding domain (Figure ; see Figure S1 for complete sequences). The SSSC sequences of SARS-CoV predicted by the three deep neural network-based systems well reproduced those of the PDB data (6acc_A, 5xlr_A, and 5x58_A), including the structured loops. The observed SSSC sequence of SARS-CoV-2 main protease (6lu7_A) corresponded well to the predicted ones (av. 0.919, see Figure S2). In contrast with the relatively undeformable receptor-binding motif (binding to human ACE2) of SARS-CoV, the corresponding motif of SARS-CoV-2 (aa 437 to 508) indicated the possibility of conformational change between the α-helix and β-strand. This possibility was also supported by a Quick2D analysis, including a series of secondary structure predictions (Figure ).[8] Actually, the receptor-binding motif SSSCs of SARS-CoV-2 with blanks for the Cryo-EM structure data of the entire SARS-CoV-2 spike protein (6vsb and 6vxx) differ greatly from those of SARS-CoV, with those of SARS-CoV-2 being more flexible (Figure ). On the other hand, the receptor-binding motif SSSCs of SARS-CoV-2 connected with human ACE2 for the Cryo-EM or X-ray structure data of the partial receptor-binding domain (6m17, 6vw1_E, 6lzg_B, 6m0j_E, and 6w41_C) are very similar to those of SARS-CoV. Wrapp and co-workers reported that although spike protein S1 of SARS-CoV-2 binds human ACE2 with higher affinity than that of SARS-CoV, several published SARS-CoV receptor-binding domain-specific monoclonal antibodies do not have appreciable binding to that of SARS-CoV-2.[37] Yuan and co-workers described that a neutralizing antibody previously isolated from a convalescent SARS patient, in complex with the receptor-binding domain of the SARS-CoV-2 spike protein, targets a highly conserved epitope, distal from the receptor-binding site, that enables cross-reactive binding between SARS-CoV-2 and SARS-CoV.[43] The observed SSSC of the highly conserved epitope (6w41_C) resembles that of SSSCPred100, and the gap of the epitope SSSCs among the three systems is smaller than that of the receptor-binding motif SSSCs (underlined SSSCs in Figure ). It is suggested that although the binding of receptor-binding motif to human ACE2 stabilizes the connected conformation, the flexibility of receptor-binding motif in SARS-CoV-2 disturbs the appreciable binding of the SARS-CoV receptor-binding motif-specific monoclonal antibodies.

Figure 3

Sequence flexibility/rigidity map of SARS-CoV-2 RBD and SARS-CoV RBD. The identical SSSC sequences among the predicted ones by three deep neural network-based systems and the corresponding observed ones are colored (blue: α-helix-type conformation; red: β-sheet-type conformation; green: a variety of other-type conformations). A comparison of the SSSCPreds result (SARS-CoV-2) with that of Quick2D is also shown. The sequence flexibility/rigidity map of SARS-CoV-2 RBD resembles the sequence-to-phenotype maps, which were experimentally obtained by deep mutational scanning (Figure ).[1] This suggests that the identical SSSC sequences among the predicted ones by three deep neural network-based systems (SSSCPred200, SSSCPred100, and SSSCPred) with the observed ones correlate well with the sequences of both lower ACE2-binding affinity and lower expression.[1] The sequence flexibility/rigidity map obtained from SSSCPreds would be therefore available for no-cost mutation-site selection of proteins in general.

Spike Protein S2

The sequence identity of spike protein S2 between SARS-CoV-2 and SARS-CoV (aa 668 to 1255, about 90% identity) was greater than that of S1 (see Figure S1). Only one identical motif (SSSC: SSSHSSHHHH) among all of the compared SSSC sequences, including predicted and observed ones, was found at the S2 subunit (Figure ). This motif is extremely rare: only 200 subunit files containing the SSSC sequence of the motif exist among all of the 582,666 PDB subunit files (see Figure S3). Usually, the number of subunits for a commonplace motif (SSSC: SSSHHTSSS) is about 140,000. Even for an already reported common motif (SSSC: SSSHHSHSSS) in antibodies and in major histocompatibility complex class I and II molecules, 34,039 subunits exist.[29] Apart from virus proteins, integrin αL (leukocyte function-associated antigen 1)[49,50] and cell division protein kinase 2 (CDK2),[51] which are involved in cell adhesion and cell division, are the main proteins that have such a relatively undeformable motif (Figure ). For CDK2 with cyclin A, an adenosine-5′-triphosphate (ATP) molecule interacts with this motif.[52] The SSSC of this motif in the free form of CDK2 (1buh_A)[53] is identical to that in the ATP-binding form (1fin_A).[52] The relatively undeformable motif protrudes on the molecular surface, and the amino acid sequence of the motif for SARS-CoV differs from the other proteins (6acc_A: LPPLLTDDMI; 3f74_A: YKTEFDFSDY; 3ig7_A: EFLHQDLKKF). The motif (SSSC: SSSHSSHHHH) was also found in the subunits of reaction center protein H, glyceraldehyde-3-phosphate dehydrogenase, glutamine phosphoribosylpyrophosphate, green fluorescent protein fp512, and insulin-degrading enzyme. Furthermore, the other coronavirus, murine hepatitis virus, has the identical motif (3jcl_A).[54] This motif may correlate with proteins recognizing phosphates such as phospholipids. Walls and co-workers found that the SARS-CoV-2 S glycoprotein harbors a furin cleavage site (NSPRRAR↓S) at the boundary between the S1/S2 subunits, which is processed during biogenesis and sets this virus apart from SARS-CoV and SARS-related CoVs.[40] Therefore, the relatively undeformable motif at the S2 subunit may be available for the drug discovery targets. The unique motif belongs to the connecting region of S2 subunit,[33] and the entire structure of the postfusion state S2 subunit of SARS-CoV-2 has not been available.[31] The predicted conformations of heptad repeat 1 (HR1) and HR2 at the S2 subunit well correspond to the observed ones in the partial structure of postfusion state (see Figure S1).[55] The SSSC sequence change from “STSHHHSSTSSS” to “SSTTSSSSSSSS” contributes to the transformation to postfusion hairpin state (Figure ). The SSSCPreds analysis also suggests that the sequence “AGFIKQYGDCLGDIAARDLI” in the connecting region of S2 subunit easily forms successive stable α-helix-type conformations (Figure ) by the interaction of fusion peptide (FP) and unique motif with the host membrane (Figure ). The prediction results near the connecting region of S2 subunit including the relatively undeformable motif would be useful to understand in terms of the conformational change from prefusion native state to postfusion hairpin state via prehairpin intermediate state.

Figure 4

Figure 5

Common undeformable motifs of subunits in keyword-tagged datasets (blue; SSSC: SSSHSSHHHH). (A) SARS-CoV (6acc, monomer), (B) SARS-CoV (6acc, trimer), (C) integrin αL (3f74), (D) leukocyte function-associated antigen 1 (1zop), and (E) cell division protein kinase 2 (3ig7). The relatively undeformable motif protrudes on the molecular surface.

Figure 6

Cryo-EM structures of SARS-CoV-2 S2. (A) Prefusion native state (6xr8_A) and (B) postfusion hairpin state (6xra_A). The S1 subunit is omitted for clarity. CTD1: C-terminal domain 1; UH: upstream helix; L: linker region; FP: fusion peptide; CR: connecting region; HR1: heptad repeat 1; CH: central helix; BH: β-hairpin; SD3: subdomain 3; HR2: heptad repeat 2.

Sequence flexibility/rigidity map of the connecting regions of S2 subunits for SARS-CoV-2 and SARS-CoV. The identical SSSC sequences among the predicted ones by three deep neural network-based systems and the corresponding observed ones are colored (blue: α-helix-type conformation; red: β-sheet-type conformation; green: a variety of other-type conformations). A comparison of the SSSCPreds result (SARS-CoV-2) with that of Quick2D is also shown. UH: upstream helix; L: linker region; FP: fusion peptide; CR: connecting region; HR1: heptad repeat 1. Common undeformable motifs of subunits in keyword-tagged datasets (blue; SSSC: SSSHSSHHHH). (A) SARS-CoV (6acc, monomer), (B) SARS-CoV (6acc, trimer), (C) integrin αL (3f74), (D) leukocyte function-associated antigen 1 (1zop), and (E) cell division protein kinase 2 (3ig7). The relatively undeformable motif protrudes on the molecular surface. Cryo-EM structures of SARS-CoV-2 S2. (A) Prefusion native state (6xr8_A) and (B) postfusion hairpin state (6xra_A). The S1 subunit is omitted for clarity. CTD1: C-terminal domain 1; UH: upstream helix; L: linker region; FP: fusion peptide; CR: connecting region; HR1: heptad repeat 1; CH: central helix; BH: β-hairpin; SD3: subdomain 3; HR2: heptad repeat 2.

ORF8 Protein

The sequence identity of ORF8 proteins between SARS-CoV[45,46] and SARS-CoV-2[47,48] (about 24% identity) is very low, and the function is in debate. The related data of ORF8 SSSC sequences are not contained in the training datasets at all. As shown in Figure , the concordance rate of SSSCs between SSSCPred100 and SSSCPred data for SARS-CoV-2 ORF8 (0.75) is larger than that for SARS-CoV ORF8a/b (0.64). The predicted SSSCs of SARS-CoV-2 ORF8 with many stable β-strands such as an immunoglobulin-like fold quite correspond to the observed ones (Figure ). It is suggested that three sets of intramolecular disulfide bonds per monomer immobilize the thermodynamically unstable conformations (Figure ), especially one of those in the motif of the amino acid sequence “YTVSCLPFT” (SSSC: HSHSHTSSS).[48] Furthermore, the predicted SSSCs explain well that the flexibility of β3 motif for SARS-CoV ORF8a/b interferes with the crystallization (Figure ).

Figure 7

Figure 8

Crystal structure of SARS-CoV-2 ORF8. The motif (green; SSSC: HSHSHTSSS) is immobilized by the disulfide bond (7jtl_A). The predicted SSSCs suggest that the rare motif (red; SSSC: SSSHSHHTHSS) and the β3 motif of SARS-CoV ORF8a/b are flexible.

Sequence flexibility/rigidity map of SARS-CoV-2 ORF8, SARS-CoV ORF8a/b, and SARS-CoV GZ02 ORF8. The identical SSSC sequences between SSSCPred100 and SSSCPred and the corresponding observed ones are colored (blue: α-helix-type conformation; red: β-sheet-type conformation; green: a variety of other-type conformations). A comparison of the SSSCPreds result (SARS-CoV-2) with that of Quick2D is also shown. Crystal structure of SARS-CoV-2 ORF8. The motif (green; SSSC: HSHSHTSSS) is immobilized by the disulfide bond (7jtl_A). The predicted SSSCs suggest that the rare motif (red; SSSC: SSSHSHHTHSS) and the β3 motif of SARS-CoV ORF8a/b are flexible. As shown in Figure , a very rare motif (SSSC: SSSHSHHTHSS) with the amino acid sequence “RCSFYEDFLEY” in the observed SSSCs of SARS-CoV-2 was found (7jtl_A). A total of 1372 subunit files containing the SSSC sequence exist among all of the 582,666 PDB subunit files (see Figure S4), and the main subunits are endothiapepsin, DNA damage-binding protein 1, isocitrate dehydrogenase [NADP] cytoplasmic, DNA-directed RNA polymerase subunit α, 70 kDa peptidylprolyl isomerase, purple acid phosphatase, α-1,6-mannanase, interferon regulatory factor 4, β-fructofuranosidase, ethanolamine utilization protein eutl, crispr-associated helicase, hemagglutinin-neuraminidase glycoprotein, plasminogen, sigma A, and catalase-peroxidase 2 (see Figure S4). A 2-(N-morpholino)-ethanesulfonic acid molecule interacts with this motif in a subunit of DNA damage-binding protein 1 (4a08_A). Although the predicted SSSCs suggest that the motif is flexible, the motif may correlate with the regulation of phosphorylated substances such as DNA, RNA, and phosphorylated proteins and may contribute to the difference that ORF8 of SARS-CoV-2, but not ORF8 or ORF8a/b of SARS-CoV, downregulates MHC-I in cells.[48]

Conclusions

The deep neural network-based program for sequence-based prediction of SSSCs (SSSCPrediction) and the comparison program (SSSCPreds) of three deep neural network-based prediction systems (SSSCPred200, SSSCPred100, and SSSCPred) to predict the flexibility and conformational change of proteins were constructed. The degree of flexibility for the receptor-binding motif of SARS-CoV-2 spike protein and the rigidity of the unique motif at the S2 subunit cannot be found only from the Cryo-EM and X-ray structures. This methodology provides a verified path to analyze other protein structures in a similar way when there may not be X-ray or Cryo-EM structures available such as SARS-CoV ORF8a/b. The sequence flexibility/rigidity map obtained from the combined analysis of predicted and observed SSSCs with keyword-tagged datasets would be useful to understand viral evolution.

Computational Methods

Dataset

A total of 582,813 FASTA-format files containing the amino acid sequences and SSSCs of protein subunits were extracted from 139,932 PDB files[44] by using the SSSCview program (available online at https://staff.aist.go.jp/izumi.h/SSSCPreds/index-e.html).[28] Of these FASTA files, 379,334 files containing subunits with more than or equal to 100 continuous amino acid residues were extracted, and from those files, 150,000 files as training data for the deep neural network, 10,000 files as test data for the deep neural network, and three sets of 10,000 files as test data for the inference system were randomly selected. From each FASTA file, a set of 100 continuous amino acid residues and the corresponding SSSC were randomly extracted. SSSC terms H, S, T, and D were converted to [1,0,0,0], [0,1,0,0], [0,0,1,0], and [0,0,0,1], respectively, and a set of matrices (100, 4) was constructed. The amino acid sequence was also similarly converted.[56] The dataset for the deep neural network was prepared by using Python.[57]

SSSCPrediction

Deep learning for the prediction of SSSCs from amino acid sequences was performed by using Neural Network Console 1.40 (https://dl.sony.com/app/). Neural Network Console is convenient artificial intelligence (AI) software and can automatically optimize trained networks using a Gaussian process. The revised template of network “12_residual_learning.sdcproj” for the standard MNIST (Modified National Institute of Standards and Technology) dataset was used to provide the initial structure of the deep neural network, which was then trained with our prepared training dataset. The obtained trained network is shown in Figure (activation function: ReLU; cost function: HuberLoss; max epoch: 20; batch size: 64; precision: float; structure search: Network Feature + Gaussian Process; updater: Adam; update interval: 1 iteration; alpha: 0.001; beta1: 0.9; beta2: 0.999; epsilon: 1E-8). The obtained network and parameters were introduced to the SSSCPrediction inference system, and the system was set to examine amino acid sequences containing at least 100 amino acid residues. For each amino acid sequence, SSSC terms were predicted for 50 continuous amino acid residues and for the initial and final 100 amino acid residues in the sequence. Then, the first 70 SSSC terms in the sequence were selected, followed by 50 SSSC terms; any remaining SSSC terms at the end of the sequence were also selected. The other prepared three sets of 10,000 test data files for the SSSCPrediction inference system were then used to evaluate the concordance rate using agreements of H, S, T, and D symbols.

Figure 9

Network architectures of SSSCPreds. (A) SSSCPred, (B) SSSCPred100, and (C) SSSCPred200.

Network architectures of SSSCPreds. (A) SSSCPred, (B) SSSCPred100, and (C) SSSCPred200. Comparison of SSSCPrediction with Quick2D[8] was carried out by using an amino acid sequence of a PDB file (1a00_A: HEMOGLOBIN ALPHA CHAIN). The method was benchmarked by using 612 and 17,169 protein subunits containing at least 100 amino acid residues in the CB513 and CullPDB datasets.[15] The CB513 is a nonredundant dataset, suitable for development of algorithms for prediction of the secondary protein structure. CB513 has been made for learning of the neural network for prediction of the secondary protein structure. The CullPDB dataset is a large nonhomologous sequence set produced by using the PISCES server, which culls subsets of protein sequences from the Protein Data Bank based on sequence identity and structural quality criteria. The 150,000 training data files and the 10,000 test data files for the prediction of SSSCs from amino acid sequences using the deep neural network were also tested to evaluate the concordance rate.

Comparison Program (SSSCPreds) of Three Deep Neural Network-Based Prediction Systems

Two additional deep neural network-based prediction systems were constructed by using procedures similar to that used to construct SSSCPrediction (SSSCPred). As before, a total of 582,666 FASTA-format files containing the amino acid sequences and SSSCs of protein subunits were extracted from 139,932 PDB files[44] by using the SSSCview program.[28] Of these FASTA files, 207,738 files containing subunits with more than or equal to 200 continuous amino acid residues were extracted, and from those files, 150,000 files as training data for the deep neural network, 10,000 files as test data for the deep neural network, and 10,000 files as test data for the inference system were randomly selected for SSSCPred200. From each FASTA file, a set of 200 continuous amino acid residues and the corresponding SSSC were randomly extracted. SSSC terms H, S, T, and D were converted as before to [1,0,0,0], [0,1,0,0], [0,0,1,0], and [0,0,0,1], respectively, and a set of matrices (200, 4) was constructed. The amino acid sequence was also similarly converted. Deep learning for the prediction of SSSCs from amino acid sequences was performed by using Neural Network Console 1.40 (https://dl.sony.com/app/). The revised template of network 12_residual_learning.sdcproj for the standard MNIST dataset was used to provide the initial structure of the deep neural network, which was then trained with the prepared training dataset. The obtained network (Figure ) and parameters were introduced to the SSSCPred200 inference system, and the system was set to examine amino acid sequences containing at least 200 amino acid residues. For each amino acid sequence, SSSC terms were predicted for 50 continuous amino acid residues and for the initial and final 200 amino acid residues in the sequence. Then, the first 125 SSSC terms in the sequence were selected, followed by 50 SSSC terms; any remaining SSSC terms at the end of the sequence were also selected. The three prediction programs for SSSCPreds (SSSCPred200, SSSCPred100, and SSSCPred) were obtained as follows. Training data of 200 continuous amino acid residues and 150,000 subunits were used to construct SSSCPred200, those of 100 continuous amino acid residues and 350,000 subunits were used for SSSCPred100, and those of 100 continuous amino acid residues and 150,000 subunits were used for SSSCPred. The sequence sampling range of SSSCPred200, SSSCPred100, and SSSCPred is considered to be optimal for the prediction of protein stability and rigidity from the sequences described above from the PDB. Consideration of longer sequences such SSSCPred300 would be limited to far fewer available sequences of such length, and consideration of shorter sequences, such as SSSCPred50 or SSSCPred20, would involve peptide sequences that are heavily influenced by solution environment factors that, if accounted for, would carry the correlations into a domain beyond the three programs SSSCPred200, SSSCPred100, and SSSCPred described here for stable protein sequences. SSSCPreds is available as a standalone program at https://staff.aist.go.jp/izumi.h/SSSCPreds/index-e.html.

56 in total

1. Protein secondary structure prediction based on position-specific scoring matrices.

Authors: D T Jones
Journal: J Mol Biol Date: 1999-09-17 Impact factor: 5.469

2. Homology Searches Using Supersecondary Structure Code.

Authors: Hiroshi Izumi
Journal: Methods Mol Biol Date: 2019

3. Improving protein disorder prediction by deep bidirectional long short-term memory recurrent neural networks.

Authors: Jack Hanson; Yuedong Yang; Kuldip Paliwal; Yaoqi Zhou
Journal: Bioinformatics Date: 2017-03-01 Impact factor: 6.937

4. Cryo-EM structures of MERS-CoV and SARS-CoV spike glycoproteins reveal the dynamic receptor binding domains.

Authors: Yuan Yuan; Duanfang Cao; Yanfang Zhang; Jun Ma; Jianxun Qi; Qihui Wang; Guangwen Lu; Ying Wu; Jinghua Yan; Yi Shi; Xinzheng Zhang; George F Gao
Journal: Nat Commun Date: 2017-04-10 Impact factor: 14.919

5. Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation.

Authors: Daniel Wrapp; Nianshuang Wang; Kizzmekia S Corbett; Jory A Goldsmith; Ching-Lin Hsieh; Olubukola Abiona; Barney S Graham; Jason S McLellan
Journal: Science Date: 2020-02-19 Impact factor: 47.728

6. Structural and Functional Basis of SARS-CoV-2 Entry by Using Human ACE2.

Authors: Qihui Wang; Yanfang Zhang; Lili Wu; Sheng Niu; Chunli Song; Zengyuan Zhang; Guangwen Lu; Chengpeng Qiao; Yu Hu; Kwok-Yung Yuen; Qisheng Wang; Huan Zhou; Jinghua Yan; Jianxun Qi
Journal: Cell Date: 2020-04-09 Impact factor: 41.582

7. Prediction of beta-turns at over 80% accuracy based on an ensemble of predicted secondary structures and multiple alignments.

Authors: Ce Zheng; Lukasz Kurgan
Journal: BMC Bioinformatics Date: 2008-10-10 Impact factor: 3.169

8. RefSeq: an update on prokaryotic genome annotation and curation.

Authors: Daniel H Haft; Michael DiCuccio; Azat Badretdin; Vyacheslav Brover; Vyacheslav Chetvernin; Kathleen O'Neill; Wenjun Li; Farideh Chitsaz; Myra K Derbyshire; Noreen R Gonzales; Marc Gwadz; Fu Lu; Gabriele H Marchler; James S Song; Narmada Thanki; Roxanne A Yamashita; Chanjuan Zheng; Françoise Thibaud-Nissen; Lewis Y Geer; Aron Marchler-Bauer; Kim D Pruitt
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

2. Conformational Variability Correlation Prediction of Transmissibility and Neutralization Escape Ability for Multiple Mutation SARS-CoV-2 Strains using SSSCPreds.

Authors: Hiroshi Izumi; Laurence A Nafie; Rina K Dukor
Journal: ACS Omega Date: 2021-07-16

2 in total