Literature DB >> 33385714

Genome-wide analysis of Indian SARS-CoV-2 genomes to identify T-cell and B-cell epitopes from conserved regions based on immunogenicity and antigenicity.

Nimisha Ghosh¹, Nikhil Sharma², Indrajit Saha³, Sudipto Saha⁴.

Abstract

SARS-CoV-2 has a high transmission rate and shows frequent mutations, thus making vaccine development an arduous task. However, researchers around the globe are working hard to find a solution e.g. synthetic vaccine. Here, we have performed genome-wide analysis of 566 Indian SARS-CoV-2 genomes to extract the potential conserved regions for identifying peptide based synthetic vaccines, viz. epitopes with high immunogenicity and antigenicity. In this regard, different multiple sequence alignment techniques are used to align the SARS-CoV-2 genomes separately. Subsequently, consensus conserved regions are identified after finding the conserved regions from each aligned result of alignment techniques. Further, the consensus conserved regions are refined considering that their lengths are greater than or equal to 60nt and their corresponding proteins are devoid of any stop codons. Subsequently, their specificity as query coverage are verified using Nucleotide BLAST. Finally, with these consensus conserved regions, T-cell and B-cell epitopes are identified based on their immunogenic and antigenic scores which are then used to rank the conserved regions. As a result, we have ranked 23 consensus conserved regions that are associated with different proteins. This ranking also resulted in 34 MHC-I and 37 MHC-II restricted T-cell epitopes with 16 and 19 unique HLA alleles and 29 B-cell epitopes. After ranking, the consensus conserved region from NSP3 gene is obtained that is highly immunogenic and antigenic. In order to judge the relevance of the identified epitopes, the physico-chemical properties and binding conformation of the MHC-I and MHC-II restricted T-cell epitopes are shown with respect to HLA alleles.

Entities: Chemical

Keywords: B-cell epitopes; Conserved regions; Peptide based vaccine; Physico-chemical properties; SARS-CoV-2; T-cell epitopes

Mesh：

Substances：

Year: 2020 PMID： 33385714 PMCID： PMC7831793 DOI： 10.1016/j.intimp.2020.107276

Source DB: PubMed Journal: Int Immunopharmacol ISSN： 1567-5769 Impact factor: 5.714

Introduction

In December 2019, China reported a sudden outbreak of pneumonia due to an unknown source in Hubei province, Wuhan city [1] which later got attributed to a virus named SARS-CoV-2. SARS-CoV-2 belongs to the family of Coronaviridae which also houses SARS-CoV-1 [2], [3] and MERS-CoV [4] virus. Genomic sequence analysis of the newly reported virus was found to be highly similar to that of SARS-CoV (95%–100%), thus showing the evolutionary similarity between SARS-CoV and SARS-CoV-2 [5]. By October 2020, India has registered over 7.65 million cases [6], making it one of the most affected countries in the world. Symptoms of the COVID-19 vary from fever, cough, myalgia, dyspnoea and diarrhoea to severe respiratory distress which may require life support systems. In severe cases, it may even lead to death [7]. Considering these consequences, World Health Organisation (WHO) suggested to interrupt human–human contact in the form of total lock downs along with precautionary measures such as face masks and hand sanitizers to control the spread of COVID-19. Hence, it is the need of the hour to find a cure for COVID-19 in the form of vaccine. Classical methods of vaccine design like attenuation of the virus through external sources such as micro-organisms to mitigate its harm or virulence usually depends on the response of the virus itself. Sometimes mutations in the virus genome can result in autoimmune response eventually making the virus even more virulent. Hence, such classic vaccine design approaches are time consuming, expensive and may not provide an effective response. With the evolution in bioinformatics and genome analysis, it is now possible to study the DNA, RNA and molecular evolution of a virus which can aid in development of vaccine through approaches such as reverse vaccinology. Reverse vaccinology involves pinpointing the protein sites that results into synthetic peptide based vaccines [8], [9]. The preparation of epitope based vaccine is carried out in sequential form, starting from scanning the genome of the pathogen to locating the surface proteins, followed by extracting the best epitopes situated on the surface and also testing these synthetic designs against any autoimmune response [9]. The antigens provided by the epitopes are the sites to which antibodies bind, hence selection of the best epitopes is one of the crucial and foremost steps in vaccine design. In regard to this, Skwarczynski et al. [8] have suggested several factors which influence the selection of epitopes, such as immune response to the pathogen, hypersensitivity responses and coverage of different peptide against different pathogen subtypes. Further, these epitopes can be classified into two classes i.e. MHC-I, MHC-II associated T-cell epitopes [10] and B-cell epitopes [11] based on their responses against recognized foreign pathogens. The antigens provided by MHC-I interact directly with the CD8 cells evoking the cellular response [8]. MHC-II antigens bind to the surface of the pathogens to initiate the T-helper cells (CD4) which are responsible for activating the Th1 and Th2 type helper cells in the form of cytotoxic T-lymphocyte (CTL) and humoral response through antigens loaded in MHC-I and B-cell epitopes. Hence, the selection of T-cell and B-cell epitopes is a crucial process in order to provide a reliable vaccine. By considering the several advantages presented in form of peptide-based vaccine, many studies have been carried out to design a vaccine in order to provide a stable solution against the threat as presented by SARS-CoV-2 virus. Earlier, it was found that spike (S) glycoprotein of SARS-CoV-2 can act as an intermediary to bind to the host cells with a very strong affinity, thus eventually attracting various experiments towards targeting this protein site as the potential target for vaccine design and diagnostics [12]. Following this, many types of vaccine designs have been proposed based on RNA, vectored, recombinant protein sequence and cell-cultures while focusing on the spike protein or whole virion [13]. Additionally, in Lin et al. [14] heptad repeats 1 and 2 (HR1 and HR2) in the spike protein have been predicted followed by the peptides with the help of molecular dynamics simulation between the fusion of the viral membrane and the host cell membrane, eventually limiting the spread of the virus within the host cells. Another study carried out by Vashi et al. [15] predicted 24 potential epitope fragments of which 20 were on the surface of spike protein. This information can be helpful for designing potential immunogenic peptide-based vaccines. Similar study has been conducted by Rakib et al. [16] in which spike protein region has been analysed through multiple sequence analysis in different SARS-CoV-2 genomes to predict the most immunogenic peptide fragments. In this study, a multi-epitope based vaccine has been proposed through analysing the S1 and S2 domains of spike proteins of the SARS-CoV-2 genomes in order to provide the best epitopes [17] for designing a vaccine. However, it is important to note that other protein sites can also be targeted for vaccine design as well [18]. This depends on how the T-cell interacts inside the different protein region of SARS-CoV-2. Grifoni et al. [18] have identified that 70–100% of epitope pools detect CD8 and CD4 T-cells for SARS-CoV-2. CD4+ cells interact with the other proteins like membrane (M), nucleocapsid (N) and ORF1ab proteins like NSP3, NSP4 and NSP12, but the dominance of CD4+ cells is very high within the spike region. On the other hand, no such dominant reactivity was identified in case of CD8+ cells in spike protein region. Hence, MHC-I restricted epitopes derived from M, NSP6, ORF3a or N proteins can also be considered for vaccine design. Noorimotlagh et al. [19] have conducted a review on several papers and have inferred a set of T-cell and B-cell epitopes from the Spike and Nucleocapsid proteins with high antigenicity. Genomic analysis conducted by Yadav et al. [20] on the first two cases reported in India resulted in the introduction of two non-identical strains of SARS-CoV-2. With time, more mutation points have been discovered [21] as well. This alteration in the protein region of the genome can lead to vaccine failures as was noticed in the case of Influenza virus in 2013–14 [22]. Hence, stable vaccine design is the need of the hour. Moreover, for such RNA viruses which undergo rapid mutations, Nandy et al. [9] have suggested the extraction of genomic regions which are either not influenced or very less influenced by the process of mutation. This can be carried out by analysing large set of virus genomes with the help of sequence alignment techniques. Such similar regions inside different viral genomes can be then considered for synthetic peptide vaccine designs. In [23], Gupta et al. have developed a web resource “CoronaVR” and have identified a set of T-cell and B-cell epitopes that can be incorporated in vaccine design. On the other hand, Crooke et al. [24] have used available algorithms and webtools to identify 41 T-cell epitopes (5 HLA class I, 36 HLA class II) and 6 B-cell epitopes as probable targets for epitope-based vaccine design. Ong et al. [25] have used Vaxign and the recently developed Vaxign-ML reverse vaccinology tools to predict potential vaccine candidates for COVID-19. Apart from Spike, they have identified epitopes derived from NSP3, 3CL-pro, NSP8, NSP9 and NSP10 proteins to be highly likely candidates for vaccine design. There are other works like [26], [27], [28], [29], [30], [31], [32], [33] as well pertaining to epitope identification in SARS-CoV-2 for vaccine design. In the above discussed literature, prediction of epitopes has been performed by analysing the virus proteins whereas genetical mutations are the primary reason for change in structure of the virus proteins. This fact motivated us to analyse the 566 available Indian SARS-CoV-2 genomes to identify the conserved regions to predict the immunogenic and antigenic epitopes. For this purpose, we have used four different multiple sequence alignment techniques viz. ClustalW [34], MUSCLE [35], ClustalO [36], [37] and MAFFT [38] to align the sequences. Consensus conserved regions (CCnR) are then identified after finding the conserved regions from each aligned results of the alignment techniques. Further, these conserved regions are filtered on the basis of (a) length should be greater than or equal to 60nt and (b) corresponding protein sequence should not have any stop codons. This is followed by the validation of specificity of the conserved regions as query coverage with the help of Nucleotide BLAST [39]. These filtered conserved regions are then used to identify the T-cell and B-cell epitopes based on their immunogenic and antigenic scores. Thereafter, these scores are used to rank the conserved regions. As a result, we have obtained 23 conserved regions encompassing NSP1, NSP2, NSP3, NSP4, 3CL-Proteinase, NSP10, RNA-directed RNA polymerase, Helicase, Spike glycoprotein and Nucleocapsid protein. Subsequently, the consensus conserved region in NSP3 gene has been found to be highly immunogenic and antigenic. It provides MHC-I and MHC-II restricted T-cell epitopes and B-cell epitopes, FLKKDAPYI, ITFLKKDAPYIVGDV, TLVSDIDITFLKKDAP as immunogenic and TAVVIPTKK, IDITFLKKDAPYIVG, LHPDSATLVSDIDITF as antigenic respectively. Also, different immunogenic and antigenic epitopes associated to other conserved regions are provided as well. Finally, to validate the identified epitopes, the conformational 2D non-covalent structure of the chosen epitopes is studied. Moreover, the physico-chemical properties of the epitopes along with Ramachandran plot and Z-scores are also reported in the paper.

Materials and methods

In this section, at first the data preparation is elaborated followed by the discussion on the pipeline of the proposed work. For the benefit of the readers, brief discussions on epitope based vaccine, T-cell and B-cell epitopes and their prediction tools, physico-chemical properties of epitopes and docking of T-cell epitopes are given in the supplementary file. Moreover, prediction tools for T-cell and B-cell epitopes are reported in Supplementary Tables S1 and S2.

Data preparation

In order to map the SARS-CoV-2 proteins, we have used the reference SARS-CoV-2 genome (NC_045512.2)2 and 44583 available protein sequences from the National Center for Biotechnology (NCBI). To generate the protein sequence, we have taken the reference sequence of SARS-CoV-2 genome and considered the reading frame concepts. A reading frame divides the sequence of nucleotides of the reference sequence into a set of successive, non-overlapping triplets. There are three possible reading frames: Frame 1 which starts from the first nucleotide of a reference sequence and creates the triplets, Frame 2 which starts from the second nucleotide and creates the triplets and Frame 3 which starts from the third nucleotide and creates the triplets. For each frame, these triplets are then translated into the corresponding proteins based on the codon table3 . Finally, we have obtained 25 such unique proteins which were best matched to Frame 2. Also, the recent genomic sequences of Indian SARS-CoV-2 virus have been collected from Global Initiative on Sharing All Influenza Data (GISAID)4 in fasta format. It contains 566 complete and near complete genomes with sequence ID. The average length of the 566 genomes is 29,831 bp. These 566 SARS-CoV-2 sequences are aligned using multiple sequencing alignment (MSA) techniques to extract the conserved regions. Also, the coded protein associated to each conserved region are extracted. For the alignment of sequences, High Performance Computing (HPC) facility of NITTTR, Kolkata is used. The HPC cluster has a master node with dual Intel Xeon Gold 6130 Processor having 32 Cores, 2.10 GHz, 22 MB L3 Cache and 128 GB DDR4 RAM and 2 GPU and 4 CPU computing nodes with dual Intel Xeon Gold 6152 Processor having 44 Cores, 2.1 GHz, 30 MB L3 Cache and 192 GB DDR4 RAM each, while GPU nodes have NVIDIA Tesla V100 GPU with 16 GB memory each. MSA was performed using the 2 GPU and 4 CPU computing nodes.

Pipeline of the workflow

The pipeline of the workflow is shown in Fig. 1 . To start with, we have focused on finding the conserved regions in the 566 Indian SARS-CoV-2 genome sequence which are not affected by genetic mutations. For the same, initially we have constructed a Consensus Multiple Sequence Alignment (CMSA) approach in which we have used four different alignment techniques: ClustalW, MUSCLE, ClustalO and MAFFT in order to align the 566 SARS-CoV-2 sequences. Subsequently, consensus conserved regions (CCnR) are identified after finding the conserved regions from each aligned result of alignment techniques. ClustalW initially performs pairwise alignment of all sequences by using the k-tuple method. Thereafter, MSA is created by progressively aligning the most closely related sequences based on Neighbor-Joining guide tree method. In MUSCLE technique, two distance measures are used: k-mer for unaligned pairs and Kimura method for aligned pairs of sequences. Initially, a draft MSA is produced in MUSCLE using the k-mer method. Then, a progressive alignment is constructed based on the guide tree as produced by the UPGMA method. This initial tree is then re-estimated using the Kimura distance method after which UPGMA method is once again used to produce a new guide tree, thereby creating a second MSA. New MSAs are finally created by realigning the two sequences created previously. ClustalO uses the k-tuple method to produce pairwise alignment. Then mBed is used to cluster the sequences followed by k-means clustering algorithm. Next, the guide tree is built using Unweighted Pair Group Method with Arithmetic Mean (UPGMA) method. Finally, MSA is constructed using the HHalign package. MAFFT uses two different heuristic methods, progressive (FFT-NS-2) and iterative refinement (FFT-NS-i). The main aim of MAFFT is to merge local and global algorithms for MSA. Initially, FFT-NS-2 is used to calculate all-pairwise distances to create a provisional MSA from which refined distances are calculated. Then, FFT-NS-i is performed to get the final MSA. Thereafter, to identify the conserved regions, these aligned sequences are used to compute the entropy(E).where Sx y indicates the frequency of each residue x occurring at position y and 5 represents the four possible residues as nucleotide plus gap. To identify the conserved regions (CnRs) for each alignment technique, a minimum segment length of 15 is considered with maximum average entropy as 0.2. Further, maximum entropy per position is taken as 0.2 with no gaps after finding the consensus sequence for the 566 genomic sequences. All these values are taken after following the literature. Thereafter, the CCnRs are identified considering the CnRs of all the alignment techniques. Next, a refinement process is carried out for the CCnRs based on the criteria that their length is greater then or equal to 60nt and no stop codon is present in the associated protein sequence. Moreover, Nucleotide BLAST is used to verify the specificity of the CCnRs as query coverage as well. Subsequently, T-cell and B-cell epitopes are identified from these CCnRs. To predict the T-cell and B-cell epitopes and to find their corresponding immunogenic scores, each CCnR is subjected to IEDB5 and ABCPred6 respectively. As recommended by IEDB, for the prediction of MHC-I and MHC-II T-cell epitopes, NetMHCpan7 and Consensus Approach8 [40] are selected respectively whereas for B-cell epitopes, prediction is carried out by ABCPred which uses Recurrent Neural Network. Then, by using the predicted epitopes, antigenic scores are calculated with the help of VaxiJen2.09 . For each CCnR, multiple T-cell and B-cell epitopes are identified along with their corresponding immunogenic and antigenic scores. Subsequently, for each CCnR the highest immunogenic and antigenic scores are considered to select the corresponding epitopes. Furthermore, these scores are used to rank the CCnRs based on geometric mean as given in Eq. (2). The use of geometric mean is to avoid the skewness of immunogenic and antigenic scores obtained for T-cell and B-cell epitopes so that proper ranking of the consensus conserved regions can be performed. Moreover, to validate the identified epitopes, the conformational 2D non-covalent structures of the identified epitopes are studied using LigPlot+ [41]. Furthermore, BepiPred2.0 server10 [42] is used for the verification of the predicted B-cell epitopes.Also, the physico-chemical properties of the epitopes along with Ramachandran plot are reported through PyMOL [43] and its extensive libraries Autodock Vina (for docking) [44] and PyMOD 3 [45] while for the Z-score calculation ProSA11 [46] online server is used.where, RCCnR represents rank of consensus conserved region (CCnR) based on geometric mean of immunogenic and antigenic scores of T-cell and B-cell epitopes, ISi and ASi are the scaled immunogenic and antigenic scores for MHC-I, MHC-II and B-cell epitopes respectively.

Fig. 1

Pipeline of the Workflow.

Results and discussions

Ranking of the CCnRs

Experiments in this study are carried out according to the flowchart as mentioned in Fig. 1. Initially, 566 Indian SARS-CoV-2 genomes are aligned by using Consensus Multiple Sequence Alignment (CMSA) techniques, ClustalW, MUSCLE, ClustalO and MAFFT. Subsequently, we have obtained 125 CCnRs by considering all the alignment techniques. This is shown in Fig. 2 where 438, 439, 438 and 438 conserved regions (CnRs) from ClustalW, MUSCLE, ClustalO and MAFFT respectively are provided resulting in 125 CCnRs. This is followed by mapping of the CCnRs to 11 coding regions i.e. ORF1ab, Spike, ORF3a, Envelope, Membrane, ORF6, ORF7a, ORF7b, ORF8, Nucleocapsid and ORF10. The corresponding protein sequence for each CCnR has been taken from Frame 2. Now, the 125 CCnRs are filtered based on the criteria that (a) their length should be greater than or equal to 60nt and (b) no stop codons should be present in the corresponding proteins. A BLAST specificity score as query coverage equal to 100% is also considered during the filtering process. As a result, 23 CCnRs have been identified. Subsequently, these CCnRs are ranked on the basis of geometric mean of highly immunogenic and antigenic scores of the corresponding MHC-I, MHC-II T-cell and B-cell epitopes. It is worth mentioning that the immunogenic and antigenic scores are scaled within the range of 0–1 to bring the scores of all the epitopes for different CCnRs to a uniform scale and mentioned throughout the paper while the actual scores are given as Supplementary in excel file. After ranking, top 5 CCnRs along with their corresponding protein sequences, lengths, blast specificity scores, percentage of BLAST specificity scores as query coverage, coding regions with their starting and ending coordinates, lengths and coded proteins are also mentioned in Table 1 . Moreover, the ranking with the scores of these top 5 CCnRs is reported in Table 2 . It is found from Table 1, that the top 5 CCnRs belong to the coding region which codes NSP3, 3CL-Proteinase, NSP10 and NSP4 proteins respectively. Please note that all the 23 CCnRs are reported in Supplementary Table S3 while their ranking details are given in Supplementary Table S4.

Fig. 2

125 Consensus Conserved Regions (CCnRs) from the four alignment techniques.

Table 1

Top 5 Consensus Conserved Regions (CCnRs) as derived from SARS-CoV-2 with associated details.

Consensus Conserved Region (CCnR)	Protein Sequence of CCnR	Length of	BLAST Specificity	% of BLAST Specificity	Coding	Starting	Ending	Length of CR	Coded
		CCnR	Score of CCnR	Score as Query Coverage	Region (CR)	Coordinate of CR	Coordinate of CR		Protein from CR
4012-CACAGAAAACTTGTTACTTTATATTGACATTAATGGCAATCTTCATCCAGATTCTGCCACTCTTGTTAGTGACATTGACATCACTTTCTTAAAGAAAGATGCTCCATATATAGTGGGTGATGTTGTTCAAGAGGGTGTTTTAACTGCTGTGGTTATACCTACTAAAAAGGCTGGTGGCACTACTGAAATGCTAGCGAAAGCTTT-4215	TENLLLYIDINGNLHPDSATLVSDIDITFLKKDAPYIVGDVVQEGVLTAVVIPTKKAGGTTEMLAKA	204	377	100	ORF1ab	266	21555	21290	NSP3
10463-TTAAGGGTTCATTCCTTAATGGTTCATGTGGTAGTGTTGGTTTTAACATAGATTATGACTGTGTCTCTTTTTGTTAC-10539	KGSFLNGSCGSVGFNIDYDCVSFCY	77	143	100	ORF1ab	266	21555	21290	3CL-Proteinase
13291-TTTTGTGACTTAAAAGGTAAGTATGTACAAATACCTACAACTTGTGCTAATGACCCTGTGGGTTTTACACTTAAAAACACAGTCTGTACCGTCTGCGGTAT-13391	FCDLKGKYVQIPTTCANDPVGFTLKNTVCTVCG	101	187	100	ORF1ab	266	21555	21290	NSP10
5307-TAACACTCCAACAAATAGAGTTGAAGTTTAATCCACCTGCTCTACAAGATGCTTATTACAG-5367	TLQQIELKFNPPALQDAYY	61	113	100	ORF1ab	266	21555	21290	NSP3
9564-ATTCTTACCTGGTGTTTATTCTGTTATTTACTTGTACTTGACATTTTATCTTACTAATGATGTTTCTTTTTTAGCACATATTCAGTGGATGGTT-9657	FLPGVYSVIYLYLTFYLTNDVSFLAHIQWMV	94	174	100	ORF1ab	266	21555	21290	NSP4

Table 2

Ranking procedure done on the basis of Geometric Mean of Binding and Antigenic Scores of T-cell and B-cell epitopes from each CCnR.

Consensus Conserved Region (CCnR)	Protein	Coded	MHC-I restricted T-cell		MHC-II restricted T-cell		B-cell Epitopes		Final
	Sequence	Protein	Immunogenic Score	Antigenic score	Immunogenic score	Antigenic Score	Immunogenic Score	Antigenic Score	Score
10463-CACAGAAAACTTGTTACTTTATATTGACATTAATGGCAATCTTCATCCAGATTCTGCCACTCTTGTTAGTGACATTGACATCACTTTCTTAAAGAAAGATGCTCCATATATAGTGGGTGATGTTGTTCAAGAGGGTGTTTTAACTGCTGTGGTTATACCTACTAAAAAGGCTGGTGGCACTACTGAAATGCTAGCGAAAGCTTT-10539	TENLLLYIDINGNLHPDSATLVSDIDITFLKKDAPYIVGDVVQEGVLTAVVIPTKKAGGTTEMLAKA	NSP3	0.8640	0.7361	0.9804	0.6382	0.8810	1	0.84
9104-TTAAGGGTTCATTCCTTAATGGTTCATGTGGTAGTGTTGGTTTTAACATAGATTATGACTGTGTCTCTTTTTGTTAC-9211	KGSFLNGSCGSVGFNIDYDCVSFCY	3CL-Proteinase	0.6552	0.9049	0.9114	0.7499	0.7143	0.7401	0.77
21661-TTTTGTGACTTAAAAGGTAAGTATGTACAAATACCTACAACTTGTGCTAATGACCCTGTGGGTTTTACACTTAAAAACACAGTCTGTACCGTCTGCGGTAT-21728	FCDLKGKYVQIPTTCANDPVGFTLKNTVCTVCG	NSP10	0.9136	0.7542	0.9818	0.3852	0.9048	0.6813	0.74
5220-TAACACTCCAACAAATAGAGTTGAAGTTTAATCCACCTGCTCTACAAGATGCTTATTACAG-5288	TLQQIELKFNPPALQDAYY	NSP3	0.8106	1	0.9485	0.6714	0.3333	0.8433	0.72
6706-ATTCTTACCTGGTGTTTATTCTGTTATTTACTTGTACTTGACATTTTATCTTACTAATGATGTTTCTTTTTTAGCACATATTCAGTGGATGGTT-6839	FLPGVYSVIYLYLTFYLTNDVSFLAHIQWMV	NSP4	0.9980	0.7866	0.9933	0.3326	0.9762	0.4726	0.70

125 Consensus Conserved Regions (CCnRs) from the four alignment techniques. Top 5 Consensus Conserved Regions (CCnRs) as derived from SARS-CoV-2 with associated details. Ranking procedure done on the basis of Geometric Mean of Binding and Antigenic Scores of T-cell and B-cell epitopes from each CCnR. It is important to note that although structural proteins are the popular candidates for vaccine, vaccine protection can be correlated to non-structural proteins. In this regard, [47] showed that NS1 which is a non-structural protein can bring about protective immunity against flaviviruses. Though, no neutralizing effect was shown by antibodies against NS1, some exuded complement-fixing activity and even passive transfer of anti-NS1 antibody or immunization with NS1 can lead to protection against viruses [48]. Furthermore, anti-NS1 antibody could be responsible to block NS1-induced pathogenic effects, reduce viral replication by complement-dependent cytotoxicity of infected cells and even attenuate NS1-induced disease development. This has led to NS1 being a prospective vaccine candidate against Dengue virus [49], [50]. Another core advantage of NS1 is that being a non-structural protein, the anti-NS1 antibody will not instigate antibody-dependent enhancement (ADE), which is a virulence factor causing serious repercussions. Additionally, non-structural virus proteins can generate cytotoxic T lymphocytes which are important to control infection. In [51], the authors have shown that the non-structural proteins of the hepatitis-C virus could generate HCV-specific broad-spectrum T-cell responses. Non-structural proteins have been used by [52] for vaccine design against Usutu Virus. Also, as targets for prophylactic or therapeutic vaccines, the non-structural proteins of HIV-1 were shown to be quite important [53]. Moreover, Ong et al. [25] have predicted NSP3 in SARS-CoV-2 to produce high protective antigenicity. Thus, we can hypothesize that apart from structural proteins non-structural proteins of SARS-CoV-2 can be possible targets as well for vaccine design which may induce cell-mediated or humoral immunity that is necessary to prevent viral invasion and/or replication.

Identification of MHC-I restricted T-cell epitopes

For epitope prediction from the 23 CCnRs, the associated protein sequences are used as inputs to the prediction tools. In this regard, MHC-I binding predictions are performed using IEDB [54] recommended NetMHCpan EL 4.1 (published recently in September 2020) targeting 27 unique HLA alleles. As a result, for each CCnR good binders in the form of immunogenic score, 4 best HLA epitopes are selected, in total 92 epitopes of length 9–11 mer each are obtained. Their antigenic scores are evaluated using VaxiJen2.0 [55]. In order to rank the CCnRs, only the best immunogenic and antigenic MHC-I restricted T-cell epitopes are considered. As a consequence, 34 such epitopes are identified and reported in Supplementary Table S5 for all the CCnRs while for the top 5 CCnRs, 8 epitopes are provided in Table 3 . It is found that FLKKDAPYI and TAVVIPTKK are the highly immunogenic and antigenic MHC-I restricted T-cell epitopes from the NSP3 coded protein binded to HLA-A*31:01 and HLA-A*68:01 HLA alleles respectively. All the 92 MHC-I restricted T-cell epitopes along with their HLA alleles are provided in the supplementary as an excel file.

Table 3

List of Immunogenic and Antigenic Epitopes for MHC-I, MHC-II restricted T-cell and B-cell Epitopes.

Protein	Coded	Type	MHC-I restricted T-cell				MHC-II restricted T-cell				B-cell
Sequence	Proteins		Epitope	Alleles	Scaled Score of		Epitope	Alleles	Scaled Score of		Epitope	Scaled Score of
			Epitope	Alleles	Immunogenicity	Antigenicity	Epitope	Alleles	Immunogenicity	Antigenicity	Epitope	Immunogenicity	Antigenicity
TENLLLYIDINGNLHPDSATLVSDIDITFLKKDAPYIVGDVVQEGVLTAVVIPTKKAGGTTEMLAKA	NSP3	Immunogenic	FLKKDAPYI	HLA-A*31:01	0.8640	0.3890	ITFLKKDAPYIVGDV	HLA-DRB3*01:01	0.9804	0.3036	TLVSDIDITFLKKDAP	0.8810	0.7314
	NSP3	Immunogenic	FLKKDAPYI	HLA-A*31:01	0.8640	0.3890	ITFLKKDAPYIVGDV	HLA-DRB3*01:01	0.9804	0.3036	TLVSDIDITFLKKDAP	0.8810	0.7314

KGSFLNGSCGSVGFNIDYDCVSFCY	3CL-Proteinase	Immunogenic	FLNGSCGSV	HLA-A*02:03	0.6552	0.3342	CGSVGFNIDYDCVSF	HLA-DQA101:01/DQB105:01	0.9114	0.7499	CGSVGFNIDYDCVSFC	0.7143	0.7401
	3CL-Proteinase	Antigenic	FLNGSCGSV	HLA-A*02:03	0.6552	0.3342	CGSVGFNIDYDCVSF	HLA-DQA101:01/DQB105:01	0.9114	0.7499	CGSVGFNIDYDCVSFC	0.7143	0.7401

FCDLKGKYVQIPTTCANDPVGFTLKNTVCTVCG	NSP10	Immunogenic	DLKGKYVQI	HLA-B*08:01	0.9136	0.7542	KGKYVQIPTTCANDP	HLA-DRB1*04:01	0.9818	0.1892	TTCANDPVGFTLKNTV	0.9048	0.6813
	NSP10	Immunogenic	DLKGKYVQI	HLA-B*08:01	0.9136	0.7542	KGKYVQIPTTCANDP	HLA-DRB1*04:01	0.9818	0.1892	TTCANDPVGFTLKNTV	0.9048	0.6813

TLQQIELKFNPPALQDAYY	NSP3	Immunogenic	NPPALQDAY	HLA-B*35:01	0.8106	0.4557	QIELKFNPPALQDAY	HLA-DRB3*02:02	0.9485	0.6409	LQQIELKFNPPALQDA	0.3333	0.8433
	NSP3	Immunogenic	NPPALQDAY	HLA-B*35:01	0.8106	0.4557	QIELKFNPPALQDAY	HLA-DRB3*02:02	0.9485	0.6409	LQQIELKFNPPALQDA	0.3333	0.8433

FLPGVYSVIYLYLTFYLTNDVSFLAHIQWMV	NSP4	Immunogenic	VSFLAHIQW	HLA-B*57:01	0.9980	0.7866	GVYSVIYLYLTFYLT	HLA-DPA101:03/DPB102:01	0.9933	0.3326	YSVIYLYLTFYLTNDV	0.9762	0.4726
	NSP4	Antigenic	VSFLAHIQW	HLA-B*57:01	0.9980	0.7866	GVYSVIYLYLTFYLT	HLA-DPA101:03/DPB102:01	0.9933	0.3326	YSVIYLYLTFYLTNDV	0.9762	0.4726

List of Immunogenic and Antigenic Epitopes for MHC-I, MHC-II restricted T-cell and B-cell Epitopes.

Identification of MHC-II restricted T-cell epitopes

Similar procedures are carried out for MHC-II restricted T-cell epitopes as well using MHC-II binding prediction tool provided by IEDB with consensus prediction targeting a different set of 27 unique HLA alleles. Subsequently, we obtained 92 epitopes of length 15–17 mer each which are bounded to their alleles along with their corresponding immunogenic and antigenic scores. In order to rank the CCnRs, the best immunogenic and antigenic MHC-II restricted T-cell epitopes are considered, resulting in 37 epitopes which are reported in Supplementary Table S5 for all the CCnRs. The 8 epitopes for the top 5 CCnRs are reported in Table 3. From this table, it is seen that ITFLKKDAPYIVGDV and IDITFLKKDAPYIVG are the most immunogenic and antigenic MHC-II restricted T-cell epitopes corresponding to HLA-DRB3*01:01 allele. All the 92 MHC-II restricted T-cell epitopes along with their HLA alleles are provided in the supplementary as an excel file.

Identification of B-cell epitopes

After obtaining MHC-I and MHC-II T-cell epitopes, B-cell epitopes which are responsible for antigen productions are predicted using ABCPred [56] with the length of 15–18 mer and their antigenic scores are evaluated from the VaxiJen server. As a result, 61 epitopes are found. In order to rank the CCnRs, the best immunogenic and antigenic B-cell epitopes are considered which resulted in 29 epitopes. These epitopes are reported in Supplementary Table S5 for all the CCnRs while for the top 5 CCnRs, 6 B-cell epitopes are reported in Table 3. In this table, it is found that TLVSDIDITFLKKDAP and LHPDSATLVSDIDITF are the most immunogenic and antigenic B-cell epitopes. Here, it should be noted that for antigenicity evaluation, a threshold of 0.4 is maintained throughout the experiment by following the literature [20]. The graphical representation of TLVSDIDITFLKKDAP and LHPDSATLVSDIDITF is shown in Fig. 3 using BepiPred 2.0 where the total green and yellow regions represent the protein sequence TENLLLYIDINGNLHPDSATLVSDIDITFLKKDAPYIVGDVVQEGVLTAVVIPTKKAGGTTEMLAKA while the two yellow regions denote the B-cell epitopes TLVSDIDITFLKKDAP and LHPDSATLVSDIDITF respectively. The red line in the figure represents the threshold which is set to 0.5. For all the 23 CCnRs the results are shown in Supplementary Fig. S1 while the 61 B-cell epitopes are provided in the supplementary as an excel file.

Fig. 3

Graphical representation of B-cell epitopes for TLVSDIDITFLKKDAP and LHPDSATLVSDIDITF with the threshold marked by red line.

Final panel of epitopes

Table 4 summarises the final panel of the 34 MHC-I, 37 MHC-II restricted T-cell epitopes and 29 B-cell epitopes for 23 CCnRs based on their highest immunogenic and antigenic scores. There are 16 unique HLA alleles for MHC-I and 19 unique HLA alleles for MHC-II restricted T-cell epitopes. The associated coded proteins for the 23 CCnRs are NSP1, NSP2, NSP3, NSP4, 3CL-Proteinase, NSP10, RNA-directed RNA polymerase, Helicase, Spike glycoprotein and Nucleocapsid protein. For better readability, the epitopes associated with the top 5 CCnRs are underlined in Fig. 4 whereas the epitopes for 23 CCnRs are underlined in Supplementary Fig. S2. The red lines, green lines and the blue lines respectively denote the MHC-I, MHC-II T-cells and B-cells respectively. Moreover, for the ease of the readers, all the details related to the 125 CCnRs, 92 MHC-I and MHC-II restricted T-cell epitopes and 61 B-cell epitopes are provided in the supplementary as an excel file, the link of which is given in Table S6. Additionally, a list of MHC-I and MHC-II restricted T-cell and B-cell epitopes for SARS-CoV-2 as collected from different sources in the literature like [26], [27], [17], [28], [16], [15], [20], [24], [23], [29], [30], [31], [32], [33], [25] are reported in Table 5. For space constraint, 3 of each MHC-I and MHC-II restricted T-cell and B-cell epitopes from each paper are mentioned in this table while the list of all the MHC-I and MHC-II restricted T-cell and B-cell epitopes are given in the supplementary as an excel file as given in Table S6. Thus, Table 4, Table 5 can provide the readers a better insight into the epitopes identified so far.

Table 4

Overview of MHC-I, MHC-II restricted T-cell and B-cell epitopes for the 23 CCnRs.

Coded	Type	MHC-I restricted T-cell		MHC-II restricted T-cell		B-cell Epitopes
Proteins		Epitopes	HLA Alleles	Epitopes	HLA Alleles
NSP3	Immunogenic	FLKKDAPYI	HLA-A*31:01	ITFLKKDAPYIVGDV	HLA-DRB3*01:01	TLVSDIDITFLKKDAP
NSP3	Antigenic	TAVVIPTKK	HLA-A*68:01	IDITFLKKDAPYIVG	HLA-DRB3*01:01	LHPDSATLVSDIDITF
3CL-Proteinase	Immunogenic	FLNGSCGSV	HLA-A*02:03	CGSVGFNIDYDCVSF	HLA-DQA101:01/DQB105:01	CGSVGFNIDYDCVSFC
3CL-Proteinase	Antigenic	GSVGFNIDY	HLA-A*30:02	CGSVGFNIDYDCVSF	HLA-DQA101:01/DQB105:01	CGSVGFNIDYDCVSFC

NSP10	Immunogenic	DLKGKYVQI	HLA-B*08:01	KGKYVQIPTTCANDP	HLA-DRB1*04:01	TTCANDPVGFTLKNTV
NSP10	Antigenic	DLKGKYVQI	HLA-B*08:01	DLKGKYVQIPTTCAN	HLA-DRB1*04:01	TTCANDPVGFTLKNTV

NSP3	Immunogenic	NPPALQDAY	HLA-B*35:01	QIELKFNPPALQDAY	HLA-DRB3*02:02	LQQIELKFNPPALQDA
NSP3	Antigenic	IELKFNPPAL	HLA-B*40:01	IELKFNPPALQDAYY	HLA-DRB3*02:02	LQQIELKFNPPALQDA

NSP4	Immunogenic	VSFLAHIQW	HLA-B*57:01	GVYSVIYLYLTFYLT	HLA-DPA101:03/DPB102:01	YSVIYLYLTFYLTNDV
NSP4	Antigenic	VSFLAHIQW	HLA-B*57:01	GVYSVIYLYLTFYLT	HLA-DPA101:03/DPB102:01	YSVIYLYLTFYLTNDV

NSP3	Immunogenic	QVNGLTSIKW	HLA-B*57:01	PQVNGLTSIKWADNN	HLA-DQA101:02/DQB106:02	KYPQVNGLTSIKWADN
NSP3	Antigenic	QVNGLTSIKW	HLA-B*57:01	KYPQVNGLTSIKWAD	HLA-DQA101:02/DQB106:02	KYPQVNGLTSIKWADN

Helicase	Immunogenic	RAQNMTMSY	HLA-A*30:02	YQLKLLIHHRAQNMT	HLA-DRB4*01:01	FWDYQLKLLIHHRAQN
Helicase	Antigenic	RAQNMTMSY	HLA-A*30:02	DYQLKLLIHHRAQNM	HLA-DRB4*01:02	IHHRAQNMTMSYSLKP

Spike glycoprotein	Immunogenic	HADQLTPTW	HLA-B*58:01	DIPIGAGICASYQTQ	HLA-DQA105:01/DQB103:01	GCLIGAEHVNNSYECD
Spike glycoprotein	Antigenic	HADQLTPTW	HLA-B*58:01	DIPIGAGICASYQTQ	HLA-DQA105:01/DQB103:01	GCLIGAEHVNNSYECD

NSP4	Immunogenic	ICISTKHFYW	HLA-B*57:01	KHFYWFFSNYLKRRV	HLA-DPA101:03/DPB104:01	ISTKHFYWFFSNYLKR
NSP4	Antigenic	ICISTKHFYW	HLA-B*57:01	TKHFYWFFSNYLKRR	HLA-DPA101:03/DPB104:01	ISTKHFYWFFSNYLKR

Nucleocapsid protein	Immunogenic	AQFAPSASAF	HLA-B*15:01	ATKAYNVTQAFGRR	HLA-DRB5*01:01	KSAAEASKKPRQKRTA
Nucleocapsid protein	Antigenic	AQFAPSASAF	HLA-B*15:01	KAYNVTQAFGRRGP	HLA-DRB5*01:01	GRRGPEQTQGNFGDQE

Spike glycoprotein	Immunogenic	FERDISTEI	HLA-B*40:01	VEGFNCYFPLQSYGF	HLA-DQA101:01/DQB105:01	GSTPCNGVEGFNCYFP
Spike glycoprotein	Antigenic	YFPLQSYGF	HLA-A*24:02	NGVEGFNCYFPLQSY	HLA-DRB3*01:01	EGFNCYFPLQSYGFQP

NSP4	Immunogenic	NVLEGSVAY	HLA-B*35:01	PVPYCYDTNVLEGSV	HLA-DRB1*04:01	SGKPVPYCYDTNVLEG
NSP4	Antigenic	SGKPVPYCY	HLA-A*30:02	GKPVPYCYDTNVLEG	HLA-DRB1*04:01	SGKPVPYCYDTNVLEG

Helicase	Immunogenic	VLAYVDHSY	HLA-B*15:01	VDHSYVVNAVTTMSY	HLA-DRB3*02:02	LAYVDHSYVVNAVTTM
Helicase	Antigenic	VLAYVDHSY	HLA-B*15:01	VDHSYVVNAVTTMSY	HLA-DRB3*02:02	LAYVDHSYVVNAVTTM

NSP3	Immunogenic	NYMPYFFTL	HLA-A*24:02	CTNYMPYFFTLLLQL	HLA-DPA103:01/DPB104:02	VCTNYMPYFFTLLLQL
NSP3	Antigenic	NYMPYFFTL	HLA-A*24:02	CTNYMPYFFTLLLQL	HLA-DPA103:01/DPB104:02	VCTNYMPYFFTLLLQL

NSP10	Immunogenic	FAVDAAKAY	HLA-B*35:01	LSFCAFAVDAAKAYK	HLA-DRB3*01:01	GTGQAITVTPEANMDQ
NSP10	Antigenic	VPANSTVLSF	HLA-B*35:01	LSFCAFAVDAAKAYK	HLA-DRB3*01:01	KMLCTHTGTGQAITVT

3CL-Proteinase	Immunogenic	GTTTLNGLW	HLA-B*57:01	TTTLNGLWLDDVVYC	HLA-DQA101:01/DQB105:01	QVTCGTTTLNGLWLDD
3CL-Proteinase	Antigenic	GTTTLNGLW	HLA-B*57:01	TLNGLWLDDVVYCPR	HLA-DQA101:01/DQB105:01	QVTCGTTTLNGLWLDD

NSP1	Immunogenic	HVGEIPVAY	HLA-B*15:01	VAYRKVLLRKNGNKG	HLA-DRB1*11:01	PHVGEIPVAYRKVLLR
NSP1	Antigenic	HVGEIPVAYR	HLA-A*68:01	IPVAYRKVLLRKNGN	HLA-DRB1*11:01	PHVGEIPVAYRKVLLR

NSP4	Immunogenic	RPDTRYVLM	HLA-B*07:02	LMDGSIIQFPNTYLE	HLA-DRB1*15:01	GSIIQFPNTYLEGSVR
NSP4	Antigenic	RPDTRYVLM	HLA-B*07:02	LMDGSIIQFPNTYLE	HLA-DRB1*15:01	LRPDTRYVLMDGSIIQ

NSP4	Immunogenic	VCVSTSGRW	HLA-B*57:01	TSGRWVLNNDYYRSL	HLA-DRB3*02:02	YCRHGTCERSEAGVCV
NSP4	Antigenic	VCVSTSGRW	HLA-B*57:01	STSGRWVLNNDYYRS	HLA-DRB3*02:02	YCRHGTCERSEAGVCV
RNA-directed	Immunogenic	DTLSLTTNMK	HLA-A*68:01	TTNMKKQFIIYLRIV	HLA-DPA102:01/DPB105:01	LRDTLSLTTNMKKQFI
RNA polymerase	Antigenic	LSLTTNMKK	HLA-A*11:01	TTNMKKQFIIYLRIV	HLA-DPA102:01/DPB105:01	LRDTLSLTTNMKKQFI

NSP2	Immunogenic	VTHSKGLYR	HLA-A*31:01	ETFVTHSKGLYRKCV	HLA-DRB5*01:01	LNLGETFVTHSKGLYR
NSP2	Antigenic	VTHSKGLYRK	HLA-A*03:01	LGETFVTHSKGLYRK	HLA-DRB5*01:01	LNLGETFVTHSKGLYR

Spike glycoprotein	Immunogenic	VYYPDKVFR	HLA-A*31:01	TRGVYYPDKVFRSSV	HLA-DRB1*03:01	RGVYYPDKVFRSSVLH
Spike glycoprotein	Antigenic	GVYYPDKVFR	HLA-A*31:01	TRGVYYPDKVFRSSV	HLA-DRB1*03:01	RGVYYPDKVFRSSVLH

NSP2	Immunogenic	LEQPTSEAV	HLA-B*40:01	GDLQPLEQPTSEAVE	HLA-DQA103:01/DQB103:02	TGDLQPLEQPTSEAVE
NSP2	Antigenic	EVVLKTGDL	HLA-A*26:01	EVVLKTGDLQPLEQP	HLA-DRB1*08:02	TGDLQPLEQPTSEAVE

Fig. 4

MHC-I, MHC-II restricted T-cell and B-cell epitopes underlined in the protein sequences of top 5 CCnRs for (a) NSP3 (b) 3CL-Proteinase (c) NSP10 (d) NSP3 and (e) NSP4.

Table 5

List of proposed epitopes for SARS-CoV-2 as given in the literature.

Source	Coded Proteins	MHC-I restricted T-cell Epitopes	MHC-II restricted T-cell Epitopes	B-cell Epitopes
Bhattacharya et al. [26]	Spike glycoprotein	SQCVNLTTR	IHVSGTNGT	SQCVNLTTRTQLPPAYTNSFTRGVY
		YTNSFTRGV	VYYHKNNKS	FSNVTWFHAIHVSGTNGTKRFDN
		GVYYHKNNK	LVRDLPQGF	DPFLGVYYHKNNKSWME

Chen et al. [27]	Spike glycoprotein	LSPRWYFYY	IKLDDKDPN	EVRQIAPGQTGKIADY
		RSRNSSRNS	RSGARSKQR	GCLIGAEHVNNSYECD
		IGYYRRATR	RIGMEVTPS	FAMQMAYRFNGIGVTQ

Naz et al. [17]	Spike glycoprotein	GVYFASTEK	EFVFKNIDGYFKIYS	YNSASFSTFKCYGVSPTKLNDLCFT
		STQDLFLPF	QPYRVVVLSFELLHA
		KTSVDCTMY	MTKTSVDCTMYICGD

Kar et al. [28]	Spike glycoprotein	QIITTDNTF	INITRFQTLLALHRS	FSYTESLAGKREMAII
		YQPYRVVVL	GINITRFQTLLALHR	HAGPGPGPY
		FTISVTTEI	GWTFGAGAALQIPFA	KMGPGPGTRFA

Rakib et al. [16]	Spike glycoprotein	WTAGAAAYY	LIVNNATNV	RTQLPPAYTNS
		CNDPFLGVY	IVNNATNVV	SGTNGTKRFDN
		GAAAYYVGY	SKTQSLLIV	LTPGDSSSGWTAG

Vashi et al. [15]	Spike glycoprotein	RTQLPPAY	MFVFLVLLPLVSSQC	PPAYTNSFTRGVYY
		RTQLPPA	MFVFLVLLPLVSSQCVN	HVSGTNGTKRFDN
		LPPAYTNSF	QGNFKNLREFVFKNI	YYHKNNKSWMES

Yadav et al. [20]	Spike glycoprotein	GVYFASTEK	NA	HRSYLTPGDSSSGWTA
		FEYVSQPFL	NA	FPNITNLCPFGEVFNA
		WTAGAAAYY	NA	EVIQIAPGQTGKIADY

Crooke et al. [24]	Membrane glycoprotein	ATSRTLSYY	TLSYYKLGASQRVAG	EVTPSGTWL
		RLFARTRSM	RTLSYYKLGASQRVA	KLDDKDPNFK
		YANRNRFLY	ASFRLFARTRSMWSF	KTFPPTEPKKDKKKKADETQALPQ

Gupta et al. [23]	Spike glycoprotein	VRFPNITNL	NVTWFHAIHV	GDEVRQIAPGQTGKIADYNYKLP
		YQPYRVVVL
		PYRVVVLSF

Bhatnager et al. [29]	Spike glycoprotein	LTDEMIAQY	VASQSIIAYTMSLGA	KEEQIGKCSTR
		LLTDEMIAQY	LTDEMIAQYTSALLA	ELGKYEQYGPGPGKWP
		IPFAMQMAY	VLNDILSRLDKVEAE	IRAGPGPGGNC

Kwarteng et al. [30]	Nucleocapsid protein	KTFPPTEPK	AQFAPSASAFFGMSR	AGLPYGANK
		SSPDDQIGY	IAQFAPSASAFFGMS	SKQLQQSMSSADS
		SSPDDQIGYY	PQIAQFAPSASAFFG	RRIRGGDGKMKDL

Baruah et al. [31]	Spike glycoprotein	YLQPRTFLL	NA	CVNLTTRTQLPPAYTN
		GVYFASTEK		NVTWFHAIHVSGTNG
		EPVLKGVKL		SFSTFKCYGVSPTKLND

Bency et al. [32]	Spike glycoprotein	KIADYNYKL	VVFLHVTYV	MDLEGKQGNFKNL
		CYGVSPTKL	IGINITRFQ	YYVGYLQPR
		VVVLSFELL	FNCYFPLQS	NITNLCPFGE

Singh et al. [33]	Nucleocapsid protein	AQFAPSASA	AQFAPSASAFFGMSR	KEDLKFP
		GDAALALLL	GDAALALLLLDRLNQ	IKLDDKDPNFKDQ
		GMSRIGMEV	ASAFFGMSRIGMEVT	PPTEPKKDKKKKADETQALPQRQKKQQTVT

Ong et al. [25]	NSP3	STNVTIATY	ISNSWLMWLIINLVQ	EDEEEGDCEEEEFEPSTQYEYGTEDDYQGKPLEFGATS
		RMYIFFASF	LAYILFTRFFYVLGL	EEEQEEDWLDDD
		AEWFLAYIL	AAIMQLFFSYFAVHF	VGQQDGSEDNQ

MHC-I, MHC-II restricted T-cell and B-cell epitopes underlined in the protein sequences of top 5 CCnRs for (a) NSP3 (b) 3CL-Proteinase (c) NSP10 (d) NSP3 and (e) NSP4. Overview of MHC-I, MHC-II restricted T-cell and B-cell epitopes for the 23 CCnRs. List of proposed epitopes for SARS-CoV-2 as given in the literature.

Study of physico-chemical properties of epitopes

To judge the relevance of the epitopes as found in this work, we have evaluated the physico-chemical properties for each selected epitope. The values of each physico-chemical property lie between 0 and 1. Table 6, Table 7, Table 8 show the physico-chemical properties for MHC-I, MHC-II restricted T-cell and B-cell epitopes respectively for the top 5 CCnRs whereas for all the 23 CCnRs, the results are reported in Supplementary Tables S7-S9 respectively. For example, in Table 6 MHC-I restricted T-cell epitope FLKKDAPYI has a positively charged value of 0.222, a negatively charged value of 0.111, polarity of 0.111, non-polarity of 0.556, alphaticity of 0.444, aromaticity of 0.222, acidicity of 0.111, Basicity of 0.222, hydrophobicity of 0.556, hydrophilicity of 0.333, a neutral value of 0.111, hydroxylic value of 0 and sulphur content is 0 as well. Similarly, for other epitopes their physico-chemical properties can be found in the tables.

Table 6

List of physico-chemical properties of MHC-I restricted T-cell epitopes.

MHC-I restricted T-cell epitopes	Positively charged	Negatively charged	Polarity	Non Polarity	Aliphaticity	Aromaticity	Acidicity	Basicity	Hydrophobicity	Hydrophilicity	Neutral	Hydroxylic	Sulphur Content
FLKKDAPYI	0.222	0.111	0.111	0.556	0.444	0.222	0.111	0.222	0.556	0.333	0.111	0	0
TAVVIPTKK	0.222	0	0.222	0.556	0.556	0	0	0.222	0.778	0.333	0.222	0.222	0
FLNGSCGSV	0	0	0.333	0.556	0.444	0.111	0	0	0.444	0.111	0.444	0.222	0.111
GSVGFNIDY	0	0.111	0.222	0.556	0.444	0.222	0.111	0	0.333	0.111	0.444	0.111	0
DLKGKYVQI	0.222	0.111	0.222	0.444	0.444	0.111	0.111	0.222	0.333	0.222	0.333	0	0
NPPALQDAY	0	0.111	0.222	0.556	0.556	0.111	0.111	0	0.556	0.333	0.222	0	0
IELKFNPPAL	0.1	0.1	0	0.7	0.6	0.1	0.1	0.1	0.7	0.4	0.1	0	0
VSFLAHIQW	0.111	0	0.222	0.667	0.444	0.222	0	0.111	0.667	0.111	0.222	0.111	0

Table 7

List of physico-chemical properties of MHC-II restricted T-cell epitopes.

MHC-II restricted T-cell epitopes	Positively charged	Negatively charged	Polarity	Non Polarity	Aliphaticity	Aromaticity	Acidicity	Basicity	Hydrophobicity	Hydrophilicity	Neutral	Hydroxylic	Sulphur Content
ITFLKKDAPYIVGDV	0.133	0.133	0.133	0.6	0.533	0.133	0.133	0.133	0.6	0.2	0.267	0.067	0
IDITFLKKDAPYIVG	0.133	0.133	0.133	0.6	0.533	0.133	0.133	0.133	0.6	0.2	0.267	0.067	0
CGSVGFNIDYDCVSF	0	0.133	0.333	0.467	0.333	0.2	0.133	0	0.467	0.067	0.4	0.133	0.133
KGKYVQIPTTCANDP	0.133	0.067	0.333	0.4	0.4	0.067	0.067	0.133	0.533	0.333	0.333	0.133	0.067
DLKGKYVQIPTTCAN	0.133	0.067	0.333	0.4	0.4	0.067	0.067	0.133	0.533	0.267	0.333	0.133	0.067
QIELKFNPPALQDAY	0.067	0.133	0.2	0.533	0.467	0.133	0.133	0.067	0.533	0.267	0.267	0	0
IELKFNPPALQDAYY	0.067	0.133	0.2	0.533	0.467	0.2	0.133	0.067	0.533	0.267	0.2	0	0
GVYSVIYLYLTFYLT	0	0	0.467	0.533	0.467	0.333	0	0	0.6	0	0.267	0.2	0

Table 8

List of physico-chemical properties of B-cell epitopes.

B-cell epitopes	Positively charged	Negatively charged	Polarity	Non Polarity	Aliphaticity	Aromaticity	Acidicity	Basicity	Hydrophobicity	Hydrophilicity	Neutral	Hydroxylic	Sulphur Content
TLVSDIDITFLKKDAP	0.125	0.188	0.188	0.500	0.438	0.062	0.188	0.125	0.625	0.188	0.375	0.188	0
LHPDSATLVSDIDITF	0.062	0.188	0.250	0.500	0.438	0.062	0.188	0.062	0.625	0.125	0.438	0.250	0
CGSVGFNIDYDCVSFC	0	0.125	0.375	0.438	0.312	0.188	0.125	0	0.500	0.062	0.375	0.125	0.188
TTCANDPVGFTLKNTV	0.062	0.062	0.312	0.438	0.375	0.062	0.062	0.062	0.688	0.250	0.375	0.250	0.062
LQQIELKFNPPALQDA	0.062	0.125	0.188	0.562	0.500	0.062	0.125	0.062	0.562	0.250	0.312	0	0
YSVIYLYLTFYLTNDV	0	0.062	0.438	0.438	0.375	0.312	0.062	0	0.562	0.062	0.25	0.188	0

List of physico-chemical properties of MHC-I restricted T-cell epitopes. List of physico-chemical properties of MHC-II restricted T-cell epitopes. List of physico-chemical properties of B-cell epitopes.

Study of docking with Ramachandran plot and Z-score

To further validate the identified epitopes, the conformational 2D non-covalent structures of the identified MHC-I and MHC-II restricted T-cell epitopes are studied using LigPlot+. For the highly immunogenic and antigenic epitopes of each CCnR, molecular docking is computed using Autodock Vina in order to extract the stable binding conformation of each predicted epitope allele pair. For MHC-I restricted T-cell epitopes, 12 binding scores are generated from Autodock Vina while for MHC-II 9 binding scores are generated. For some epitopes, the docking structures are unable to generate due to the unavailability of the corresponding structure of the HLA alleles. Furthermore, Ramachandran plot and Z-score are also evaluated for further validation using PyMod 3 and ProSA server respectively. The results of docking along with Z-scores are reported in Table 9 . The results for FLKKDAPYI and TAVVIPTKK which are the most highly immunogenic and antigenic MHC-I restricted T-cell epitopes are shown in Fig. 5, Fig. 6 while ITFLKKDAPYIVGDV and IDITFLKKDAPYIVG which are the most highly immunogenic and antigenic MHC-II restricted T-cell epitopes are shown in Fig. 7, Fig. 8 respectively. In these four figures, (a) shows the binding pose of the molecules of the two epitopes, (b) shows the exact binding position of the epitopes in the binding grooves of the alleles obtained from Autodock Vina with docking scores of −8.2 and −8.1 for MHC-I and −9 and −8.8 for MHC-II for both immunogenic and antigenic epitopes respectively and (c) depicts the surface interaction between the alleles and the identified epitopes showing the fitting sites in binding grooves. Further, quality of the residues inside the epitopes are evaluated on the basis of rotational spin of the atoms around bonds. This is depicted in (d) of Fig. 5, Fig. 6 for MHC-I and Fig. 7, Fig. 8 for MHC-II through Ramachandran plot in which points lying in the red region represents much more stable state of their bond orientations inside a molecule. This is followed by the Z-Score evaluation in (e) where the negative values of Z-score which are −9.81 and −5.9 for MHC-I and −5.53 and −5.59 for MHC-II as shown in Table 9 and Fig. 5, Fig. 6, Fig. 7, Fig. 8 verify the stability of the structures and (f) shows the overall negative energy values of the entire residues inside the whole structures which confirm the molecular stability of the identified epitopes. The results for docking along with Z-scores for all the 23 CCnRs are reported in Supplementary Table S10 while the corresponding structural analysis are given in Supplementary Figs. S3 and S4.

Table 9

Docking and Z-scores of MHC-I and MHC-II restricted T-cell epitopes for the top 5 ranked CCnRs.

MHC-I restricted	Score from	Z Score	MHC-II restricted	Score from	Z Score
T-cell epitopes	Autodock Vina		T-cell epitopes	Autodock Vina
FLKKDAPYI	−8.2	−9.81	ITFLKKDAPYIVGDV	−9	−5.53
TAVVIPTKK	−8.1	−5.9	IDITFLKKDAPYIVG	−8.8	−5.59
FLNGSCGSV	Not Generated	Not Generated	CGSVGFNIDYDCVSF	Not Generated	Not Generated
GSVGFNIDY	−7.1	−5.4		Not Generated	Not Generated
DLKGKYVQI	−8.1	−8.81	KGKYVQIPTTCANDP	Not Generated	Not Generated
DLKGKYVQI	−8.1	−8.81	DLKGKYVQIPTTCAN	Not Generated	Not Generated
NPPALQDAY	Not Generated	Not Generated	QIELKFNPPALQDAY	Not Generated	Not Generated
IELKFNPPAL	Not Generated	Not Generated	IELKFNPPALQDAYY	Not Generated	Not Generated
VSFLAHIQW	−8.8	−9.26	GVYSVIYLYLTFYLT	−8	−5.02

Fig. 5

Structural analysis for the highly immunogenic MHC-I restricted T-cell epitope “FLKKDAPYI” for NSP3 coded protein (a) 2D pose representation between the epitope and HLA allele showing the different non-covalent bonds (b) Docking structure of MHC-I restricted T-cell epitope (c) The surface interaction between the allele and epitopes showing the fitting sites in binding grooves (d) Ramachandran plot of the epitope allele structure showing lower energy sites of the residues in different frame (e) Z-score plot and (f) all residue energy.

Fig. 6

Structural analysis for the highly antigenic MHC-I restricted T-cell epitope “TAVVIPTKK” for NSP3 coded protein (a) 2D pose representation between the epitope and HLA allele showing the different non-covalent bonds (b) Docking structure of MHC-I restricted T-cell epitope (c) The surface interaction between the allele and epitopes showing the fitting sites in binding grooves (d) Ramachandran plot of the epitope allele structure showing lower energy sites of the residues in different frame (e) Z-score plot and (f) all residue energy.

Fig. 7

Structural analysis for the highly immunogenic MHC-II restricted T-cell epitope “ITFLKKDAPYIVGDV” for NSP3 coded protein (a) 2D pose representation between the epitope and HLA allele showing the different non-covalent bonds (b) Docking structure of MHC-II restricted T-cell epitope (c) The surface interaction between the allele and epitopes showing the fitting sites in binding grooves (d) Ramachandran plot of the epitope allele structure showing lower energy sites of the residues in different frame (e) Z-score plot and (f) all residue energy.

Fig. 8

Structural analysis for the highly antigenic MHC-II restricted T-cell epitope “IDITFLKKDAPYIVG” for NSP3 coded protein (a) 2D pose representation between the epitope and HLA allele showing the different non-covalent bonds (b) Docking structure of MHC-II restricted T-cell epitope (c) The surface interaction between the allele and epitopes showing the fitting sites in binding grooves (d) Ramachandran plot of the epitope allele structure showing lower energy sites of the residues in different frame (e) Z-score plot and (f) all residue energy.

Docking and Z-scores of MHC-I and MHC-II restricted T-cell epitopes for the top 5 ranked CCnRs. Structural analysis for the highly immunogenic MHC-I restricted T-cell epitope “FLKKDAPYI” for NSP3 coded protein (a) 2D pose representation between the epitope and HLA allele showing the different non-covalent bonds (b) Docking structure of MHC-I restricted T-cell epitope (c) The surface interaction between the allele and epitopes showing the fitting sites in binding grooves (d) Ramachandran plot of the epitope allele structure showing lower energy sites of the residues in different frame (e) Z-score plot and (f) all residue energy. Structural analysis for the highly antigenic MHC-I restricted T-cell epitope “TAVVIPTKK” for NSP3 coded protein (a) 2D pose representation between the epitope and HLA allele showing the different non-covalent bonds (b) Docking structure of MHC-I restricted T-cell epitope (c) The surface interaction between the allele and epitopes showing the fitting sites in binding grooves (d) Ramachandran plot of the epitope allele structure showing lower energy sites of the residues in different frame (e) Z-score plot and (f) all residue energy. Structural analysis for the highly immunogenic MHC-II restricted T-cell epitope “ITFLKKDAPYIVGDV” for NSP3 coded protein (a) 2D pose representation between the epitope and HLA allele showing the different non-covalent bonds (b) Docking structure of MHC-II restricted T-cell epitope (c) The surface interaction between the allele and epitopes showing the fitting sites in binding grooves (d) Ramachandran plot of the epitope allele structure showing lower energy sites of the residues in different frame (e) Z-score plot and (f) all residue energy. Structural analysis for the highly antigenic MHC-II restricted T-cell epitope “IDITFLKKDAPYIVG” for NSP3 coded protein (a) 2D pose representation between the epitope and HLA allele showing the different non-covalent bonds (b) Docking structure of MHC-II restricted T-cell epitope (c) The surface interaction between the allele and epitopes showing the fitting sites in binding grooves (d) Ramachandran plot of the epitope allele structure showing lower energy sites of the residues in different frame (e) Z-score plot and (f) all residue energy. Due to the worldwide pandemic caused by SARS-CoV-2, development of safe and effective vaccines is the need of the hour. This study has identified T-cell and B-cell epitopes using computational methods which can be used for probable vaccine design. The main advantages of this work can be summarised as (a) whole genome analysis of 566 Indian SARS-CoV-2 genomes in order to consider the genetic mutations to understand and target the virus proteins, (b) finding consensus conserved regions from four alignment techniques viz. ClustalW, MUSCLE, ClustalO and MAFFT and (c) using latest tools like NetMHCpan EL 4.1 (published in September 2020), PyMod 3 and BepiPred 2.0 for computational purposes. Furthermore, we have used our own developed tool ABCpred to predict the B-cell epitopes.

Conclusion

In this work, genome-wide analysis of 566 Indian SARS-CoV-2 genomes have been performed to extract the potential conserved regions for epitope-based synthetic vaccine design which show high immunogenicity and antigenicity. In this regard, 125 CCnRs have been identified after extracting the conserved regions from the aligned sequences of the four multiple sequence alignment techniques. These CCnRs are then filtered based on three major criteria of length greater than or equal to 60nt, no stop codons in the proteins and percentage of BLAST specificity score as query coverage equal to 100%. Such filtering resulted in 23 CCnRs covering NSP1, NSP2, NSP3, NSP4, 3CL-Proteinase, NSP10, RNA-directed RNA polymerase, Helicase, Spike glycoprotein and Nucleocapsid protein. This ranking also resulted in 34 MHC-I and 37 MHC-II restricted T-cell epitopes with 16 and 19 unique HLA alleles and 29 B-cell epitopes for the 23 CCnRs. These CCnRs are then ranked based on their immunogenic and antigenic scores to identify the MHC-I and MHC-II restricted T-cell and B-cell epitopes. This ranking identified CCnR from NSP3 coded protein to be highly immunogenic and antigenic, providing MHC-I and MHC-II restricted T-cell and B-cell epitopes, FLKKDAPYI, ITFLKKDAPYIVGDV, TLVSDIDITFLKKDAP as most immunogenic and TAVVIPTKK, IDITFLKKDAPYIVG, LHPDSATLVSDIDITF as most antigenic respectively. These epitopes can be considered for designing of synthetic vaccines. Furthermore, to validate the relevance of these epitopes, their binding confirmation and physico-chemical properties are also shown with respect to HLA alleles. This study thus provides the potential MHC-I and MHC-II restricted T-cell and B-cell epitopes to design epitope-based synthetic vaccines.

Ethics approval and consent to participate

The ethical approval or individual consent was not applicable.

Availability of data and materials

The aligned 566 Indian SARS-CoV-2 genomes with reference as well as consensus sequences and the final results of this work are available at “http://www.nitttrkol.ac.in/indrajit/projects/COVID-EpitopeVaccine-India/”. Moreover, Indian SARS-CoV-2 genomes used in this work are publicly available at GISAID database.

Consent for publication

Not applicable.

Funding

This work has been partially supported by CRG short term research grant on COVID-19 (CVD/2020/000991) from Science and Engineering Research Board (SERB), Department of Science and Technology, Govt. of India.

Author contributions

Nimisha Ghosh: Formal analysis; Methodology, Coding; Visualization; Writing - original draft & editing, Nikhil Sharma: Methodology; Coding; Visualization; Writing - review & editing, Indrajit Saha: Conceptualization; Data curation; Supervision; Funding acquisition; Formal analysis; Investigation; Methodology; Project administration; Resources; Validation; Visualization; Writing - review & editing, Sudipto Saha: Conceptualization; Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no conflict of interest.

50 in total

1. Virology. The SARS coronavirus: a postgenomic era.

Authors: Kathryn V Holmes; Luis Enjuanes
Journal: Science Date: 2003-05-30 Impact factor: 47.728

2. A candidate multi-epitope vaccine against SARS-CoV-2.

Authors: Tamalika Kar; Utkarsh Narsaria; Srijita Basak; Debashrito Deb; Filippo Castiglione; David M Mueller; Anurag P Srivastava
Journal: Sci Rep Date: 2020-07-02 Impact factor: 4.379

3. Immune and bioinformatics identification of T cell and B cell epitopes in the protein structure of SARS-CoV-2: A systematic review.

Authors: Zahra Noorimotlagh; Chiman Karami; Seyyed Abbas Mirzaee; Mohammadreza Kaffashian; Sanaz Mami; Mahdieh Azizi
Journal: Int Immunopharmacol Date: 2020-06-28 Impact factor: 4.932

4. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China.

Authors: Chaolin Huang; Yeming Wang; Xingwang Li; Lili Ren; Jianping Zhao; Yi Hu; Li Zhang; Guohui Fan; Jiuyang Xu; Xiaoying Gu; Zhenshun Cheng; Ting Yu; Jiaan Xia; Yuan Wei; Wenjuan Wu; Xuelei Xie; Wen Yin; Hui Li; Min Liu; Yan Xiao; Hong Gao; Li Guo; Jungang Xie; Guangfa Wang; Rongmeng Jiang; Zhancheng Gao; Qi Jin; Jianwei Wang; Bin Cao
Journal: Lancet Date: 2020-01-24 Impact factor: 79.321

Review 5. T-cell epitope vaccine design by immunoinformatics.

Authors: Atanas Patronov; Irini Doytchinova
Journal: Open Biol Date: 2013-01-08 Impact factor: 6.411

6. A systematic assessment of MHC class II peptide binding predictions and evaluation of a consensus approach.

Authors: Peng Wang; John Sidney; Courtney Dow; Bianca Mothé; Alessandro Sette; Bjoern Peters
Journal: PLoS Comput Biol Date: 2008-04-04 Impact factor: 4.475

7. ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins.

Authors: Markus Wiederstein; Manfred J Sippl
Journal: Nucleic Acids Res Date: 2007-05-21 Impact factor: 16.971

Review 8. Dengue virus non-structural protein 1: a pathogenic factor, therapeutic target, and vaccine candidate.

Authors: Hong-Ru Chen; Yen-Chung Lai; Trai-Ming Yeh
Journal: J Biomed Sci Date: 2018-07-24 Impact factor: 8.410

9. Immunoinformatic identification of B cell and T cell epitopes in the SARS-CoV-2 proteome.

Authors: Stephen N Crooke; Inna G Ovsyannikova; Richard B Kennedy; Gregory A Poland
Journal: Sci Rep Date: 2020-08-25 Impact factor: 4.379

10. Cryo-EM Structure of the 2019-nCoV Spike in the Prefusion Conformation.

Authors: Daniel Wrapp; Nianshuang Wang; Kizzmekia S Corbett; Jory A Goldsmith; Ching-Lin Hsieh; Olubukola Abiona; Barney S Graham; Jason S McLellan
Journal: bioRxiv Date: 2020-02-15

2 in total

1. Genome-wide analysis of 10664 SARS-CoV-2 genomes to identify virus strains in 73 countries based on single nucleotide polymorphism.

Authors: Nimisha Ghosh; Indrajit Saha; Nikhil Sharma; Suman Nandi; Dariusz Plewczynski
Journal: Virus Res Date: 2021-03-26 Impact factor: 3.303

2. Computational construction of a glycoprotein multi-epitope subunit vaccine candidate for old and new South-African SARS-CoV-2 virus strains.

Authors: Olugbenga Oluseun Oluwagbemi; Elijah Kolawole Oladipo; Emmanuel Oluwatobi Dairo; Ayodele Eugene Ayeni; Boluwatife Ayobami Irewolede; Esther Moradeyo Jimah; Moyosoluwa Precious Oyewole; Boluwatife Mary Olawale; Hadijat Motunrayo Adegoke; Adewale Joseph Ogunleye
Journal: Inform Med Unlocked Date: 2022-01-15

2 in total