Hiroshi Izumi1, Laurence A Nafie2,3, Rina K Dukor3. 1. National Institute of Advanced Industrial Science and Technology (AIST), AIST Tsukuba West, 16-1 Onogawa, Tsukuba, Ibaraki 305-8569, Japan. 2. Department of Chemistry, Syracuse University, Syracuse, New York 13244-4100, United States. 3. BioTools Inc., 17546 SR 710 (Bee Line Hwy), Jupiter, Florida 33458, United States.
Abstract
Amino acid mutations that improve protein stability and rigidity can accompany increases in binding affinity. Therefore, conserved amino acids located on a protein surface may be successfully targeted by antibodies. The quantitative deep mutational scanning approach is an excellent technique to understand viral evolution, and the obtained data can be utilized to develop a vaccine. However, the application of the approach to all of the proteins in general is difficult in terms of cost. To address this need, we report the construction of a deep neural network-based program for sequence-based prediction of supersecondary structure codes (SSSCs), called SSSCPrediction (SSSCPred). Further, to predict conformational flexibility or rigidity in proteins, a comparison program called SSSCPreds that consists of three deep neural network-based prediction systems (SSSCPred, SSSCPred100, and SSSCPred200) has also been developed. Using our algorithms we calculated here shows the degree of flexibility for the receptor-binding motif of SARS-CoV-2 spike protein and the rigidity of the unique motif (SSSC: SSSHSSHHHH) at the S2 subunit and has a value independent of the X-ray and Cryo-EM structures. The fact that the sequence flexibility/rigidity map of SARS-CoV-2 RBD resembles the sequence-to-phenotype maps of ACE2-binding affinity and expression, which were experimentally obtained by deep mutational scanning, suggests that the identical SSSC sequences among the ones predicted by three deep neural network-based systems correlate well with the sequences with both lower ACE2-binding affinity and lower expression. The combined analysis of predicted and observed SSSCs with keyword-tagged datasets would be helpful in understanding the structural correlation to the examined system.
Amino acid mutations that improve protein stability and rigidity can accompany increases in binding affinity. Therefore, conserved amino acids located on a protein surface may be successfully targeted by antibodies. The quantitative deep mutational scanning approach is an excellent technique to understand viral evolution, and the obtained data can be utilized to develop a vaccine. However, the application of the approach to all of the proteins in general is difficult in terms of cost. To address this need, we report the construction of a deep neural network-based program for sequence-based prediction of supersecondary structure codes (SSSCs), called SSSCPrediction (SSSCPred). Further, to predict conformational flexibility or rigidity in proteins, a comparison program called SSSCPreds that consists of three deep neural network-based prediction systems (SSSCPred, SSSCPred100, and SSSCPred200) has also been developed. Using our algorithms we calculated here shows the degree of flexibility for the receptor-binding motif of SARS-CoV-2spike protein and the rigidity of the unique motif (SSSC: SSSHSSHHHH) at the S2 subunit and has a value independent of the X-ray and Cryo-EM structures. The fact that the sequence flexibility/rigidity map of SARS-CoV-2 RBD resembles the sequence-to-phenotype maps of ACE2-binding affinity and expression, which were experimentally obtained by deep mutational scanning, suggests that the identical SSSC sequences among the ones predicted by three deep neural network-based systems correlate well with the sequences with both lower ACE2-binding affinity and lower expression. The combined analysis of predicted and observed SSSCs with keyword-tagged datasets would be helpful in understanding the structural correlation to the examined system.
In
general, the effects of amino acid mutation on functions such
as binding between proteins and expression are correlated.[1] The correlation between expression and binding
suggests that mutations that improve stability and rigidity accompany
increases in binding affinity.[2] Therefore,
conserved amino acids located on the protein surface can be more successfully
targeted by antibodies.[1] For this purpose,
a quantitative deep mutational scanning approach is an excellent technique
to understand viral evolution, and the obtained data can be utilized
to develop a vaccine.[1] However, there are
approximately 110.3 million nonredundant protein sequences in the
RefSeq database,[3,4] and the application of the approach
to all of the proteins in general is currently difficult. A deep learning-based
prediction of the conformational rigidity may be available as a no-cost
alternative. Many methods for sequence-based prediction of secondary
and supersecondary structures have been developed in the past several
years,[5−13] and many secondary structure prediction methods based on deep learning
have also been reported.[14−18] Further, Zhang and co-workers have reported recently that the 3D
structure prediction method C-I-TASSER incorporating a deep learning-based
contact map prediction can create structural appearances of the full-length
proteins.[19] However, the classification
and prediction of fine-structured loops other than α-helixes,
β-strands, coiled coils,[20−22] and disordered regions[23,24] remain elusive. There currently is no way to evaluate whether a
particular protein sequence is flexible with the shape when cryo-electron
microscopy (Cryo-EM) or X-ray structure of that sequence is not available
as a guide. SSSCPreds, described in this work, is the first, and to
date only, program that can simultaneously predict locations of protein
flexibility or rigidity and the shapes of those regions with high
accuracy. It does this by comparing different 3D conformation prediction
programs that are based only on protein sequences. The detail of conformations
could not be discussed by using only the appearance of a molecular
model, but rather a comparison of the observed SSSC sequences with
the predicted ones obtained from the examined systems as embodied
in SSSCPreds, as described here, would be necessary.In the
past decade, a means of identifying and codifying supersecondary
structures (supersecondary structure code, SSSC) has been developed
by us that uses the concept of Ramachandran plot data[25−27] with ω angles and the specification of positions of torsion
angles in a protein. These data are derived from a fuzzy search of
structural code homology using template patterns, represented as conformational
codes, such as 3a5c4a (α-helix-type conformation) and 6c4a4a
(β-sheet-type conformation), to describe supersecondary structural
motifs and their conformation.[28,29] The SSSC is transcribed
as a conformation propensity using the letters “H”,
“S”, “T”, and “D” for each
amino acid peptide unit referring to an α-helix-type conformation
(H), a β-sheet-type conformation (S), a variety of other-type
conformations (T), and disordered residues or the C-terminus (D).
This code has been approved as a protocol for a molecular biology
database[28] and can be used to distinguish
the difference of characteristic loop structures between IgG immunoglobulin
(SSSC: SHHSHSS) and IgM rheumatoid factor (SSSC: TTTSSSS).[28,29] On the other hand, interferon α, β, and γ, GroEL,
and ubiquitin-associated domains have a unique common structure code
motif (SSSC: HHHTTSHHH).[28]Recently,
a deep neural network-based program for sequence-based
prediction of SSSCs called SSSCPrediction (SSSCPred) was constructed
first. Then, a comparison program (SSSCPreds that includes SSSCPred)
of three deep neural network-based prediction systems (SSSCPred, SSSCPred100,
and SSSCPred200) to predict the flexibility and conformational change
of proteins was developed. SSSCPred alone does not indicate which
part of amino acid sequences is flexible or rigid because additional
reference data from SSSCPred100 and SSSCPred200 as contained in SSSCPreds
are necessary for comparisons. As we were completing our SSSCPreds
comparative analysis of the regions of flexibility of the SARS-CoV-2
protein, independent structure information became available by means
of X-ray crystallography and Cryo-EM. This additional information
strengthened the conclusions of our flexibility analysis that are
based solely on the amino acid sequence data and are not constrained
by the structure of the SARS-CoV-2 protein obtained in a single crystal
(X-ray) or in a frozen aqueous medium (Cryo-EM). Nevertheless, despite
the lack of structural constraint, the result of our SSSCPreds prediction
methodology is completely consistent with the reported X-ray and Cryo-EM
structures.To develop a vaccine against the coronavirus disease
2019 (COVID-19),
which is currently prevalent all over the world, structural information
on the virus is required.[30] The protein
sequence of severe acute respiratory syndrome coronavirus (SARS-CoV)
moderately resembles that of SARS-CoV-2 (about 79% identity).[30] Several observed structures of spike proteins
in SARS-CoV[31−36] and SARS-CoV-2,[37−43] including Cryo-EM structures except the full postfusion state S2
proteins, have been registered in the PDB database[44] and are thus available for use in comparing the predicted
SSSCs of SARS-CoV-2. On the other hand, the sequence identity of ORF8
(Open Reading Frame 8) proteins between SARS-CoV[45,46] and SARS-CoV-2[47,48] (about 24% identity) is very
low. Although the structure of SARS-CoV-2 has been reported recently,[48] the structure of SARS-CoV has been unsolved
yet. ORF8 disrupts IFN-I signaling when exogenously overexpressed
in cells.[47] The other report that ORF8
of SARS-CoV-2, but not ORF8 or ORF8a/b of SARS-CoV, downregulates
MHC-I in cells has been published.[48] For
SARS-CoV, the full 122 amino acid protein encoded by ORF8 induces
ATF6-dependent transcription, which triggers the expression of chaperones.[45] A 29-nucleotide deletion (Δ29), splitting
of SARS-CoVORF8 into ORF8a and ORF8b, is correlated with milder disease.[48]In this paper we present, for the first
time, a sequence flexibility/rigidity
map was obtained from a deep learning comparison program, SSSCPreds,
that uses input from three structure prediction programs SSSCPred,
SSSCPred100, and SSCPred200. This sequence flexibility/rigidity map
resembles the sequence-to-phenotype maps[1] of the SARS-CoV-2 receptor-binding domain (RBD). As a particularly
important and urgent demonstration of SSSCPreds, we assess the flexibility
of the SARS-CoV-2 RBD and ORF8 and the rigidity of the nearby S2 region.
Results
and Discussion
Translation of Amino Acid Sequences to SSSCs
The comparison
of SSSCPrediction with Quick2D[8] was carried
out by using the PDB file (1a00_A: HEMOGLOBIN ALPHA CHAIN). As shown
in Figure , the main
difference between SSSCPrediction and Quick2D was found in the structured
loop regions. Only SSSCPrediction could predict the fine loop conformations.
Although a direct comparison could not be made because of the difference
of correct data between SSSCPrediction and other prediction methods,
the concordance rates for the translation of amino acid sequences
to SSSCs using 612 and 17,169 protein subunits containing at least
100 amino acid residues in the CB513 and CullPDB datasets[15] for the benchmark of SSSCPrediction were 0.88
and 0.86, respectively.
Figure 1
Comparison of SSSCPrediction with Quick2D.[8] The PDB file (1a00_A) was used for comparison
(SSSCPrediction: H,
α-helix-type conformation; S, β-sheet-type conformation;
T, other-type conformation; D, disordered residue or C-terminus. Quick2D:
H, α-helix; E, β-strand; D, disorder).
Comparison of SSSCPrediction with Quick2D.[8] The PDB file (1a00_A) was used for comparison
(SSSCPrediction: H,
α-helix-type conformation; S, β-sheet-type conformation;
T, other-type conformation; D, disordered residue or C-terminus. Quick2D:
H, α-helix; E, β-strand; D, disorder).The average concordance rate for the translation of amino
acid
sequences to SSSCs using the three test datasets comprising 10,000
FASTA files each was 0.90. A total of 3450 files in the test dataset
had a concordance rate of ≥0.95, and 6000 files had a concordance
rate of ≥0.90 (Figure ). In the past three decades, much progress has been made
in the development of accurate predictors of the protein secondary
structure. Recently, prediction accuracy has increased from about
82 to 84%, which is approaching the estimated upper accuracy limit
of around 88%.[5,14−16] Again as stated
above, although a direct comparison of accuracy is impossible due
to the difference of correct data between SSSCPrediction and other
prediction methods, these prediction accuracies, or concordance rates,
are comparable.
Figure 2
Distribution map of the number of subunits per concordance
interval.
The average concordance rate of translation of amino acid sequences
to SSSCs was 0.90.
Distribution map of the number of subunits per concordance
interval.
The average concordance rate of translation of amino acid sequences
to SSSCs was 0.90.The correlation between
keywords of subunit names in the training
files and concordance rates was examined to understand more about
the target subunits for SSSCPrediction. For files containing the keywords
PROTEASOME, FAB, LYSOZYME, HEMOGLOBIN, MICROGLOBULIN, HLA, and MYOGLOBIN,
the ratio of files with those keywords and a concordance rate of ≥0.90
to the total number of files with those keywords was extremely high
(≥0.92; Table ). In contrast, for files containing the keywords ALKALINE (ratio
of files with that keyword and a concordance rate of ≥0.90
to the total number of files with that keyword: 4/97), GLUCOSIDASE
(81/245), OUTER MEMBRANE (158/362), ENVELOPE (122/271), PORIN (126/262),
REPLICATION (134/271), INTERLEUKIN (265/472), and RIBOSOMAL PROTEIN
(1259/1908), the concordance rate was much lower (<0.60); however,
these keywords were sometimes found in files with a concordance rate
of ≥0.90 and were associated with flexible conformations, and
there were no keywords found only in files with a low concordance
rate. In the 379,334 files in the overall dataset, the keywords KINASE
(4219/6080), TRANSFERASE (3812/6010), SYNTHASE (2868/4159), REDUCTASE
(3050/4302), DEHYDROGENASE (2545/3815), HYDROGENASE (2732/4120), POLYMERASE
(1863/2888), HYDROLASE (1199/2041), PROTEASE (1344/1765), PHOSPHATASE
(990/1690), ISOMERASE (1279/1912), and OXIDASE (1086/1682) frequently
appeared, and the ratios of files with those keywords and a concordance
rate of ≥0.90 to the total number of files with those keywords
ranged from 0.59 to 0.76. Thus, there were no keywords associated
only with a low concordance rate, and an identical amino acid sequence
of a flexible protein may possibly have different SSSC sequences.
Table 1
Keywords Included in the Training
Dataset Files that Afforded High Concordance Rates
keyword
files with
a concordance rate of ≥0.90
(A)
total number of files (B)
A/B ratio
PROTEASOME
3283
3551
0.92
FAB
1989
2071
0.96
LYSOZYME
786
830
0.95
HEMOGLOBIN
760
825
0.92
MICROGLOBULIN
501
534
0.94
HLA
402
424
0.95
MYOGLOBIN
174
178
0.98
To confirm whether the flexibility
and conformational change of
proteins can be predicted or not, two additional deep neural network-based
prediction systems were constructed by using procedures similar to
that used to construct SSSCPrediction (SSSCPred). The benchmarks (average
concordance rates) of the three systems were as follows: for SSSCPred200,
0.905 (CullPDB;[15] 9851 subunits) and 0.911
(CB513;[15] 361 subunits); for SSSCPred100,
0.896 (CullPDB; 17,169 subunits) and 0.907 (CB513; 612 subunits);
for SSSCPred, 0.861 (CullPDB; 17,169 subunits) and 0.882 (CB513; 612
subunits). For CullPDB files, the total number of files with a concordance
rate of <0.65 between SSSCPred200 and PDB data was 66. Of these
CullPDB files, the ratio of files with a concordance rate of <0.70
between SSSCPred200 and SSSCPred100 data to the total number of files
was 0.83 (see Table S1). For CB513 files,
the total number of files with a concordance rate of <0.75 between
SSSCPred200 and PDB data was 17. Of these CB513 files, the ratio of
files with a concordance rate of <0.80 between SSSCPred200 and
SSSCPred100 data to the total number of files was 0.59 (see Table S2). Exceptionally, in the CB513 files,
the subunit with the keyword PHOSPHOGLYCERATE MUTASE 1 (3pgm_A) showed
the high concordance rate (0.91) between SSSCPred200 and SSSCPred100
data in contrast with the low concordance rate (0.62) between SSSCPred200
and PDB data. In that case, the PDB files (1qhf_A, 1bq3_A, 4pgm_A,
and 5pgm_A) of the same keyword with high concordance rates (0.96,
0.97, 0.96, and 0.99) between SSSCPred200 and PDB data were found.
This means that the PDB files 3pgm_A and 1qhf_A have the identical
amino acid sequence, but the SSSC sequences, which reflect the subunit
flexibility, are largely different. For CullPDB files, the total number
of files with a concordance rate of <0.65 between SSSCPred200 and
SSSCPred100 was 80. Of these CullPDB files, the ratio of files with
a concordance rate of <0.75 between SSSCPred200 and PDB data to
the total number of files was 0.80 (64/80). The value size of concordance
rates among the three systems (SSSCPreds) provides a good indication
of the flexibility of the protein subunits.
Predicted and Observed
SSSC Sequences of SARS-CoV-2 Proteins
Spike Protein RBD
We then compared the predicted and
observed SSSC sequences of spike proteins of SARS-CoV-2 and SARS-CoV
at the receptor-binding domain (Figure ; see Figure S1 for complete
sequences). The SSSC sequences of SARS-CoV predicted by the three
deep neural network-based systems well reproduced those of the PDB
data (6acc_A,
5xlr_A, and 5x58_A), including the structured loops. The observed
SSSC sequence of SARS-CoV-2 main protease (6lu7_A) corresponded well
to the predicted ones (av. 0.919, see Figure S2). In contrast with the relatively undeformable receptor-binding
motif (binding to humanACE2) of SARS-CoV, the corresponding motif
of SARS-CoV-2 (aa 437 to 508) indicated the possibility of conformational
change between the α-helix and β-strand. This possibility
was also supported by a Quick2D analysis, including a series of secondary
structure predictions (Figure ).[8] Actually, the receptor-binding
motif SSSCs of SARS-CoV-2 with blanks for the Cryo-EM structure data
of the entire SARS-CoV-2spike protein (6vsb and 6vxx) differ greatly from those of SARS-CoV,
with those of SARS-CoV-2 being more flexible (Figure ). On the other hand, the receptor-binding
motif SSSCs of SARS-CoV-2 connected with humanACE2 for the Cryo-EM
or X-ray structure data of the partial receptor-binding domain (6m17, 6vw1_E, 6lzg_B,
6m0j_E, and 6w41_C) are very similar to those of SARS-CoV. Wrapp and
co-workers reported that although spike protein S1 of SARS-CoV-2 binds
humanACE2 with higher affinity than that of SARS-CoV, several published
SARS-CoV receptor-binding domain-specific monoclonal antibodies do
not have appreciable binding to that of SARS-CoV-2.[37] Yuan and co-workers described that a neutralizing antibody
previously isolated from a convalescent SARS patient, in complex with
the receptor-binding domain of the SARS-CoV-2spike protein, targets
a highly conserved epitope, distal from the receptor-binding site,
that enables cross-reactive binding between SARS-CoV-2 and SARS-CoV.[43] The observed SSSC of the highly conserved epitope
(6w41_C) resembles that of SSSCPred100, and the gap of the epitope
SSSCs among the three systems is smaller than that of the receptor-binding
motif SSSCs (underlined SSSCs in Figure ). It is suggested that although the binding
of receptor-binding motif to humanACE2 stabilizes the connected conformation,
the flexibility of receptor-binding motif in SARS-CoV-2 disturbs the
appreciable binding of the SARS-CoV receptor-binding motif-specific
monoclonal antibodies.
Figure 3
Sequence flexibility/rigidity map of SARS-CoV-2 RBD and
SARS-CoV
RBD. The identical SSSC sequences among the predicted ones by three
deep neural network-based systems and the corresponding observed ones
are colored (blue: α-helix-type conformation; red: β-sheet-type
conformation; green: a variety of other-type conformations). A comparison
of the SSSCPreds result (SARS-CoV-2) with that of Quick2D is also
shown.
Sequence flexibility/rigidity map of SARS-CoV-2 RBD and
SARS-CoV
RBD. The identical SSSC sequences among the predicted ones by three
deep neural network-based systems and the corresponding observed ones
are colored (blue: α-helix-type conformation; red: β-sheet-type
conformation; green: a variety of other-type conformations). A comparison
of the SSSCPreds result (SARS-CoV-2) with that of Quick2D is also
shown.The sequence flexibility/rigidity
map of SARS-CoV-2 RBD resembles
the sequence-to-phenotype maps, which were experimentally obtained
by deep mutational scanning (Figure ).[1] This suggests that the
identical SSSC sequences among the predicted ones by three deep neural
network-based systems (SSSCPred200, SSSCPred100, and SSSCPred) with
the observed ones correlate well with the sequences of both lower
ACE2-binding affinity and lower expression.[1] The sequence flexibility/rigidity map obtained from SSSCPreds would
be therefore available for no-cost mutation-site selection of proteins
in general.
Spike Protein S2
The sequence identity
of spike protein
S2 between SARS-CoV-2 and SARS-CoV (aa 668 to 1255, about 90% identity)
was greater than that of S1 (see Figure S1). Only one identical motif (SSSC: SSSHSSHHHH) among all of the compared
SSSC sequences, including predicted and observed ones, was found at
the S2 subunit (Figure ). This motif is extremely rare: only 200 subunit files containing
the SSSC sequence of the motif exist among all of the 582,666 PDB
subunit files (see Figure S3). Usually,
the number of subunits for a commonplace motif (SSSC: SSSHHTSSS) is
about 140,000. Even for an already reported common motif (SSSC: SSSHHSHSSS)
in antibodies and in major histocompatibility complex class I and
II molecules, 34,039 subunits exist.[29] Apart
from virus proteins, integrin αL (leukocyte function-associated
antigen 1)[49,50] and cell division protein kinase
2 (CDK2),[51] which are involved in cell
adhesion and cell division, are the main proteins that have such a
relatively undeformable motif (Figure ). For CDK2 with cyclin A, an adenosine-5′-triphosphate
(ATP) molecule interacts with this motif.[52] The SSSC of this motif in the free form of CDK2 (1buh_A)[53] is identical to that in the ATP-binding form
(1fin_A).[52] The relatively undeformable
motif protrudes on the molecular surface, and the amino acid sequence
of the motif for SARS-CoV differs from the other proteins (6acc_A: LPPLLTDDMI; 3f74_A: YKTEFDFSDY; 3ig7_A: EFLHQDLKKF).
The motif (SSSC: SSSHSSHHHH) was also found in the subunits of reaction
center protein H, glyceraldehyde-3-phosphate dehydrogenase, glutamine
phosphoribosylpyrophosphate, green fluorescent protein fp512, and
insulin-degrading enzyme. Furthermore, the other coronavirus, murine
hepatitis virus, has the identical motif (3jcl_A).[54] This motif may correlate with proteins recognizing phosphates
such as phospholipids. Walls and co-workers found that the SARS-CoV-2
S glycoprotein harbors a furin cleavage site (NSPRRAR↓S)
at the boundary between the S1/S2 subunits, which is processed during
biogenesis and sets this virus apart from SARS-CoV and SARS-related
CoVs.[40] Therefore, the relatively undeformable
motif at the S2 subunit may be available for the drug discovery targets.
The unique motif belongs to the connecting region of S2 subunit,[33] and the entire structure of the postfusion state
S2 subunit of SARS-CoV-2 has not been available.[31] The predicted conformations of heptad repeat 1 (HR1) and
HR2 at the S2 subunit well correspond to the observed ones in the
partial structure of postfusion state (see Figure S1).[55] The SSSC sequence change
from “STSHHHSSTSSS” to “SSTTSSSSSSSS”
contributes to the transformation to postfusion hairpin state (Figure ). The SSSCPreds
analysis also suggests that the sequence “AGFIKQYGDCLGDIAARDLI”
in the connecting region of S2 subunit easily forms successive stable
α-helix-type conformations (Figure ) by the interaction of fusion peptide (FP)
and unique motif with the host membrane (Figure ). The prediction results near the connecting
region of S2 subunit including the relatively undeformable motif would
be useful to understand in terms of the conformational change from
prefusion native state to postfusion hairpin state via prehairpin
intermediate state.
Figure 4
Sequence flexibility/rigidity map of the connecting regions
of
S2 subunits for SARS-CoV-2 and SARS-CoV. The identical SSSC sequences
among the predicted ones by three deep neural network-based systems
and the corresponding observed ones are colored (blue: α-helix-type
conformation; red: β-sheet-type conformation; green: a variety
of other-type conformations). A comparison of the SSSCPreds result
(SARS-CoV-2) with that of Quick2D is also shown. UH: upstream helix;
L: linker region; FP: fusion peptide; CR: connecting region; HR1:
heptad repeat 1.
Figure 5
Common undeformable motifs
of subunits in keyword-tagged datasets
(blue; SSSC: SSSHSSHHHH). (A) SARS-CoV (6acc, monomer), (B) SARS-CoV (6acc, trimer), (C) integrin
αL (3f74), (D) leukocyte function-associated antigen 1 (1zop), and (E) cell division
protein kinase 2 (3ig7). The relatively undeformable motif protrudes on the molecular surface.
Figure 6
Cryo-EM structures of SARS-CoV-2 S2. (A) Prefusion native
state
(6xr8_A) and (B) postfusion hairpin state (6xra_A). The S1 subunit
is omitted for clarity. CTD1: C-terminal domain 1; UH: upstream helix;
L: linker region; FP: fusion peptide; CR: connecting region; HR1:
heptad repeat 1; CH: central helix; BH: β-hairpin; SD3: subdomain
3; HR2: heptad repeat 2.
Sequence flexibility/rigidity map of the connecting regions
of
S2 subunits for SARS-CoV-2 and SARS-CoV. The identical SSSC sequences
among the predicted ones by three deep neural network-based systems
and the corresponding observed ones are colored (blue: α-helix-type
conformation; red: β-sheet-type conformation; green: a variety
of other-type conformations). A comparison of the SSSCPreds result
(SARS-CoV-2) with that of Quick2D is also shown. UH: upstream helix;
L: linker region; FP: fusion peptide; CR: connecting region; HR1:
heptad repeat 1.Common undeformable motifs
of subunits in keyword-tagged datasets
(blue; SSSC: SSSHSSHHHH). (A) SARS-CoV (6acc, monomer), (B) SARS-CoV (6acc, trimer), (C) integrin
αL (3f74), (D) leukocyte function-associated antigen 1 (1zop), and (E) cell division
protein kinase 2 (3ig7). The relatively undeformable motif protrudes on the molecular surface.Cryo-EM structures of SARS-CoV-2 S2. (A) Prefusion native
state
(6xr8_A) and (B) postfusion hairpin state (6xra_A). The S1 subunit
is omitted for clarity. CTD1: C-terminal domain 1; UH: upstream helix;
L: linker region; FP: fusion peptide; CR: connecting region; HR1:
heptad repeat 1; CH: central helix; BH: β-hairpin; SD3: subdomain
3; HR2: heptad repeat 2.
ORF8 Protein
The
sequence identity of ORF8 proteins
between SARS-CoV[45,46] and SARS-CoV-2[47,48] (about 24% identity) is very low, and the function is in debate.
The related data of ORF8 SSSC sequences are not contained in the training
datasets at all. As shown in Figure , the concordance rate of SSSCs between SSSCPred100
and SSSCPred data for SARS-CoV-2ORF8 (0.75) is larger than that for
SARS-CoVORF8a/b (0.64). The predicted SSSCs of SARS-CoV-2ORF8 with
many stable β-strands such as an immunoglobulin-like fold quite
correspond to the observed ones (Figure ). It is suggested that three sets of intramolecular
disulfide bonds per monomer immobilize the thermodynamically unstable
conformations (Figure ), especially one of those in the motif of the amino acid sequence
“YTVSCLPFT” (SSSC: HSHSHTSSS).[48] Furthermore, the predicted SSSCs explain well that the flexibility
of β3 motif for SARS-CoVORF8a/b interferes with the crystallization
(Figure ).
Figure 7
Sequence flexibility/rigidity
map of SARS-CoV-2 ORF8, SARS-CoV
ORF8a/b, and SARS-CoV GZ02 ORF8. The identical SSSC sequences between
SSSCPred100 and SSSCPred and the corresponding observed ones are colored
(blue: α-helix-type conformation; red: β-sheet-type conformation;
green: a variety of other-type conformations). A comparison of the
SSSCPreds result (SARS-CoV-2) with that of Quick2D is also shown.
Figure 8
Crystal structure of SARS-CoV-2 ORF8. The motif (green;
SSSC: HSHSHTSSS)
is immobilized by the disulfide bond (7jtl_A). The predicted SSSCs
suggest that the rare motif (red; SSSC: SSSHSHHTHSS) and the β3
motif of SARS-CoV ORF8a/b are flexible.
Sequence flexibility/rigidity
map of SARS-CoV-2ORF8, SARS-CoVORF8a/b, and SARS-CoV GZ02 ORF8. The identical SSSC sequences between
SSSCPred100 and SSSCPred and the corresponding observed ones are colored
(blue: α-helix-type conformation; red: β-sheet-type conformation;
green: a variety of other-type conformations). A comparison of the
SSSCPreds result (SARS-CoV-2) with that of Quick2D is also shown.Crystal structure of SARS-CoV-2ORF8. The motif (green;
SSSC: HSHSHTSSS)
is immobilized by the disulfide bond (7jtl_A). The predicted SSSCs
suggest that the rare motif (red; SSSC: SSSHSHHTHSS) and the β3
motif of SARS-CoVORF8a/b are flexible.As shown in Figure , a very rare motif (SSSC: SSSHSHHTHSS) with the amino acid sequence
“RCSFYEDFLEY” in the observed SSSCs of SARS-CoV-2 was
found (7jtl_A). A total of 1372 subunit files containing the SSSC
sequence exist among all of the 582,666 PDB subunit files (see Figure S4), and the main subunits are endothiapepsin,
DNA damage-binding protein 1, isocitrate dehydrogenase [NADP] cytoplasmic,
DNA-directed RNA polymerase subunit α, 70 kDa peptidylprolyl
isomerase, purple acid phosphatase, α-1,6-mannanase, interferon
regulatory factor 4, β-fructofuranosidase, ethanolamine utilization
protein eutl, crispr-associated helicase, hemagglutinin-neuraminidase
glycoprotein, plasminogen, sigma A, and catalase-peroxidase 2 (see Figure S4). A 2-(N-morpholino)-ethanesulfonic
acid molecule interacts with this motif in a subunit of DNA damage-binding
protein 1 (4a08_A). Although the predicted SSSCs suggest that the
motif is flexible, the motif may correlate with the regulation of
phosphorylated substances such as DNA, RNA, and phosphorylated proteins
and may contribute to the difference that ORF8 of SARS-CoV-2, but
not ORF8 or ORF8a/b of SARS-CoV, downregulates MHC-I in cells.[48]
Conclusions
The
deep neural network-based program for sequence-based prediction
of SSSCs (SSSCPrediction) and the comparison program (SSSCPreds) of
three deep neural network-based prediction systems (SSSCPred200, SSSCPred100,
and SSSCPred) to predict the flexibility and conformational change
of proteins were constructed. The degree of flexibility for the receptor-binding
motif of SARS-CoV-2spike protein and the rigidity of the unique motif
at the S2 subunit cannot be found only from the Cryo-EM and X-ray
structures. This methodology provides a verified path to analyze other
protein structures in a similar way when there may not be X-ray or
Cryo-EM structures available such as SARS-CoVORF8a/b. The sequence
flexibility/rigidity map obtained from the combined analysis of predicted
and observed SSSCs with keyword-tagged datasets would be useful to
understand viral evolution.
Computational Methods
Dataset
A total
of 582,813 FASTA-format files containing
the amino acid sequences and SSSCs of protein subunits were extracted
from 139,932 PDB files[44] by using the SSSCview
program (available online at https://staff.aist.go.jp/izumi.h/SSSCPreds/index-e.html).[28] Of these FASTA files, 379,334 files
containing subunits with more than or equal to 100 continuous amino
acid residues were extracted, and from those files, 150,000 files
as training data for the deep neural network, 10,000 files as test
data for the deep neural network, and three sets of 10,000 files as
test data for the inference system were randomly selected.From
each FASTA file, a set of 100 continuous amino acid residues and the
corresponding SSSC were randomly extracted. SSSC terms H, S, T, and
D were converted to [1,0,0,0], [0,1,0,0], [0,0,1,0], and [0,0,0,1],
respectively, and a set of matrices (100, 4) was constructed. The
amino acid sequence was also similarly converted.[56] The dataset for the deep neural network was prepared by
using Python.[57]
SSSCPrediction
Deep learning for the prediction of
SSSCs from amino acid sequences was performed by using Neural Network
Console 1.40 (https://dl.sony.com/app/). Neural Network Console is convenient artificial intelligence (AI)
software and can automatically optimize trained networks using a Gaussian
process. The revised template of network “12_residual_learning.sdcproj”
for the standard MNIST (Modified National Institute of Standards and
Technology) dataset was used to provide the initial structure of the
deep neural network, which was then trained with our prepared training
dataset. The obtained trained network is shown in Figure (activation function: ReLU;
cost function: HuberLoss; max epoch: 20; batch size: 64; precision:
float; structure search: Network Feature + Gaussian Process; updater:
Adam; update interval: 1 iteration; alpha: 0.001; beta1: 0.9; beta2:
0.999; epsilon: 1E-8). The obtained network and parameters were introduced
to the SSSCPrediction inference system, and the system was set to
examine amino acid sequences containing at least 100 amino acid residues.
For each amino acid sequence, SSSC terms were predicted for 50 continuous
amino acid residues and for the initial and final 100 amino acid residues
in the sequence. Then, the first 70 SSSC terms in the sequence were
selected, followed by 50 SSSC terms; any remaining SSSC terms at the
end of the sequence were also selected. The other prepared three sets
of 10,000 test data files for the SSSCPrediction inference system
were then used to evaluate the concordance rate using agreements of
H, S, T, and D symbols.
Figure 9
Network architectures of SSSCPreds. (A) SSSCPred,
(B) SSSCPred100,
and (C) SSSCPred200.
Network architectures of SSSCPreds. (A) SSSCPred,
(B) SSSCPred100,
and (C) SSSCPred200.Comparison of SSSCPrediction
with Quick2D[8] was carried out by using
an amino acid sequence of a PDB file (1a00_A:
HEMOGLOBIN ALPHA CHAIN). The method was benchmarked by using 612 and
17,169 protein subunits containing at least 100 amino acid residues
in the CB513 and CullPDB datasets.[15] The
CB513 is a nonredundant dataset, suitable for development of algorithms
for prediction of the secondary protein structure. CB513 has been
made for learning of the neural network for prediction of the secondary
protein structure. The CullPDB dataset is a large nonhomologous sequence
set produced by using the PISCES server, which culls subsets of protein
sequences from the Protein Data Bank based on sequence identity and
structural quality criteria. The 150,000 training data files and the
10,000 test data files for the prediction of SSSCs from amino acid
sequences using the deep neural network were also tested to evaluate
the concordance rate.
Comparison Program (SSSCPreds) of Three Deep
Neural Network-Based
Prediction Systems
Two additional deep neural network-based
prediction systems were constructed by using procedures similar to
that used to construct SSSCPrediction (SSSCPred). As before, a total
of 582,666 FASTA-format files containing the amino acid sequences
and SSSCs of protein subunits were extracted from 139,932 PDB files[44] by using the SSSCview program.[28] Of these FASTA files, 207,738 files containing subunits
with more than or equal to 200 continuous amino acid residues were
extracted, and from those files, 150,000 files as training data for
the deep neural network, 10,000 files as test data for the deep neural
network, and 10,000 files as test data for the inference system were
randomly selected for SSSCPred200. From each FASTA file, a set of
200 continuous amino acid residues and the corresponding SSSC were
randomly extracted. SSSC terms H, S, T, and D were converted as before
to [1,0,0,0], [0,1,0,0], [0,0,1,0], and [0,0,0,1], respectively, and
a set of matrices (200, 4) was constructed. The amino acid sequence
was also similarly converted. Deep learning for the prediction of
SSSCs from amino acid sequences was performed by using Neural Network
Console 1.40 (https://dl.sony.com/app/). The revised template of network 12_residual_learning.sdcproj for
the standard MNIST dataset was used to provide the initial structure
of the deep neural network, which was then trained with the prepared
training dataset. The obtained network (Figure ) and parameters were introduced to the SSSCPred200
inference system, and the system was set to examine amino acid sequences
containing at least 200 amino acid residues. For each amino acid sequence,
SSSC terms were predicted for 50 continuous amino acid residues and
for the initial and final 200 amino acid residues in the sequence.
Then, the first 125 SSSC terms in the sequence were selected, followed
by 50 SSSC terms; any remaining SSSC terms at the end of the sequence
were also selected.The three prediction programs for SSSCPreds
(SSSCPred200, SSSCPred100, and SSSCPred) were obtained as follows.
Training data of 200 continuous amino acid residues and 150,000 subunits
were used to construct SSSCPred200, those of 100 continuous amino
acid residues and 350,000 subunits were used for SSSCPred100, and
those of 100 continuous amino acid residues and 150,000 subunits were
used for SSSCPred. The sequence sampling range of SSSCPred200, SSSCPred100,
and SSSCPred is considered to be optimal for the prediction of protein
stability and rigidity from the sequences described above from the
PDB. Consideration of longer sequences such SSSCPred300 would be limited
to far fewer available sequences of such length, and consideration
of shorter sequences, such as SSSCPred50 or SSSCPred20, would involve
peptide sequences that are heavily influenced by solution environment
factors that, if accounted for, would carry the correlations into
a domain beyond the three programs SSSCPred200, SSSCPred100, and SSSCPred
described here for stable protein sequences.SSSCPreds is available
as a standalone program at https://staff.aist.go.jp/izumi.h/SSSCPreds/index-e.html.
Authors: Daniel Wrapp; Nianshuang Wang; Kizzmekia S Corbett; Jory A Goldsmith; Ching-Lin Hsieh; Olubukola Abiona; Barney S Graham; Jason S McLellan Journal: Science Date: 2020-02-19 Impact factor: 47.728
Authors: Daniel H Haft; Michael DiCuccio; Azat Badretdin; Vyacheslav Brover; Vyacheslav Chetvernin; Kathleen O'Neill; Wenjun Li; Farideh Chitsaz; Myra K Derbyshire; Noreen R Gonzales; Marc Gwadz; Fu Lu; Gabriele H Marchler; James S Song; Narmada Thanki; Roxanne A Yamashita; Chanjuan Zheng; Françoise Thibaud-Nissen; Lewis Y Geer; Aron Marchler-Bauer; Kim D Pruitt Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971
Authors: Meng Yuan; Nicholas C Wu; Xueyong Zhu; Chang-Chun D Lee; Ray T Y So; Huibin Lv; Chris K P Mok; Ian A Wilson Journal: Science Date: 2020-04-03 Impact factor: 47.728