Literature DB >> 36116149

Bioinformatics pipeline unveils genetic variability to synthetic vaccine design for Indian SARS-CoV-2 genomes.

Nimisha Ghosh1, Indrajit Saha2, Nikhil Sharma3, Suman Nandi4.   

Abstract

In the worrisome scenarios of various waves of SARS-CoV-2 pandemic, a comprehensive bioinformatics pipeline is essential to analyse the virus genomes in order to understand its evolution, thereby identifying mutations as signature SNPs, conserved regions and subsequently to design epitope based synthetic vaccine. We have thus performed multiple sequence alignment of 4996 Indian SARS-CoV-2 genomes as a case study using MAFFT followed by phylogenetic analysis using Nextstrain to identify virus clades. Furthermore, based on the entropy of each genomic coordinate of the aligned sequences, conserved regions are identified. After refinement of the conserved regions, based on its length, one conserved region is identified for which the primers and probes are reported for virus detection. The refined conserved regions are also used to identify T-cell and B-cell epitopes along with their immunogenic and antigenic scores. Such scores are used for selecting the most immunogenic and antigenic epitopes. By executing this pipeline, 40 unique signature SNPs are identified resulting in 23 non-synonymous signature SNPs which provide 28 amino acid changes in protein. On the other hand, 12 conserved regions are selected based on refinement criteria out of which one is selected as the potential target for virus detection. Additionally, 22 MHC-I and 21 MHC-II restricted T-cell epitopes with 10 unique HLA alleles each and 17 B-cell epitopes are obtained for 12 conserved regions. All the results are validated both quantitatively and qualitatively which show that from genetic variability to synthetic vaccine design, the proposed pipeline can be used effectively to combat SARS-CoV-2.
Copyright © 2022. Published by Elsevier B.V.

Entities:  

Keywords:  Bioinformatics Pipeline; Clade; Conserved Regions; Non-synonymous signature SNP; SARS-CoV-2; T-cell epitopes

Mesh:

Substances:

Year:  2022        PMID: 36116149      PMCID: PMC9444899          DOI: 10.1016/j.intimp.2022.109224

Source DB:  PubMed          Journal:  Int Immunopharmacol        ISSN: 1567-5769            Impact factor:   5.714


Introduction

More than two years ago, SARS-CoV-2 put a massive halt to the freedom of human movement due to its high transmission rates [1]. Early study established the fact that SARS-CoV-2 virus is highly similar to that of the SARS-CoV-1 (95%–100%) [2]. In April 2021, India registered its second sudden surge in official cases with the Delta (B.1.617.2) variant. In late 2021, the third wave hit the country which was led by Omicron. Though, India is pushing towards a very large vaccination drive, concerns over the efficacy of the vaccine for such aggressive mutations are also increasing. Meanwhile, India is not the only country which has witnessed the new mutation strain of the evolving virus, variants in South Africa (501Y.V2) [3], United Kingdom (B.1.1.7) [4], Japan (E484K) [5], Brazil (P.1) [5] are also making their rounds. The latest variant to join the bandwagon is Omicron (B.1.1.529). Although, previously it was suggested that such mutants are not going to affect the effectiveness of the vaccines currently in use, the emergence of Omicron has changed the equation. Moreover, new variants can affect the diagnosing procedure such as primer identification or antibody binding in RT-PCR. Ascoli [6] also suggested that a mutation in the Spike region of SARS-CoV-2 may affect the diagnosing procedure with greatest impact along with the increasing infection rates, transmissibility or even impacting people of younger age. In the current scenario, it is an important and urgent task to study the frequently occurring mutations within the virus. In this regard, Yuan et al. [7] have analysed 11,183 SARS-CoV-2 genome from around the globe to identify the SNPs and critical SNPs with specific high mutation frequency along with the geographical pattern analysis. Further, they have found 74 non-synonymous and 43 synonymous mutations. Most importantly they have identified Nucleocapsid (N) as the gene with the highest mutational frequency changes. This directly undermines the claim of Ascoli [6] that Nucleocapsid can be targeted for the diagnosing purposes as N gene undergoes very less mutations or is mostly conserved. Hence, it is important to take a closer look how SARS-CoV-2 is evolving over time. Moreover, Tang et al. [8] have found new developing variations on the receptor binding sites of Spike gene of SARS-CoV-2 in the form of S and L lineages. Here, S and L lineages are defined by two tightly linked SNPs at positions 8,782 (orf1ab:T8517C, synonymous) and 28,144 (ORF8: C251T, S84L) which might affect the virus pathogenesis. Phylogenetic analysis done by Maitra et al. [9] revealed the signature mutations such as C14408T in RdRp along with A23403G change in Spike protein majorly forming A2a clade within 9 Indian sequences. Further, they have also reported a triplet based mutation in N gene 2881–3 GGG/AAC which might affect the miRNAs bindings to original sequences. Genome analysis by Saha et al. [10] for 72 different countries has shown multiple unique mutation points in the form of substitution, deletion, insertion and SNPs in each country, resulting in 7209, 11700, 119 and 53 mutations respectively. Further, they have identified 11 SNPs which are unique to India, the most frequent being T1198K, A97V, T315N and P13L mutation points in NSP3, RdRp, Spike and ORF8. Therefore, it has become more important than ever to constantly monitor the continuous evolving virus in order to take up proper measures to battle the contagious virus. Study conducted by Nagy et al. [11] identified genomic alterations and the association of each mutation and outcome. As a result, they have found 3733 mutation points related to mild outcome in ORF8, NSP6, ORF3a, NSP4 and Nucleocapsid genes whereas the mutations in Spike glycoprotein, RNA polymearse, ORF3a, NSP3, ORF6 and N provided inferior outcome. Also, severe outcomes are associated to the mutations in ORF3a and NSP7 proteins. Thus, mutations are important in the significant genes such as Spike, N etc. and such mutations may even lead to a false diagnosis in RT-PCR testing. Hence, it is also important to extract the conserved regions in a genomic sequence for more effective diagnosis. In this regard, [10] have identified a conserved region in NSP6 gene as a potential target for SARS-CoV-2 detection using RT-PCR. On the other hand, alteration in the RNA virus can lead to vaccine failures as was noticed in the case of Influenza virus in 2013–14 [12]. Hence, to fight against a highly evolving virus like SARS-CoV-2, it is important to have stable vaccine. In this regard, Ghosh et al. [13] have performed a genome-wide analysis of 10644 SARS-CoV-2 sequences to identify the conserved regions in a virus genome, followed by which they have proposed epitope based vaccine design targeting the T-cell and B-cell epitopes. Another study conducted by Ghosh et al. [14] for identifying the conserved regions specifically focussed on 566-Indian SARS-CoV-2 sequences by considering four different multiple sequence alignment techniques. In both the studies most immunogenic and antigenic epitopes were derived from various coded proteins of the virus which can be targeted for synthetic vaccine design. Alam et al. [15] targeted the Spike glycoprotein to propose non-allergic, highly antigenic and non-mutant synthetic vaccine design targeting Thymus cell (T-cell) and bone marrow. Rahman et al. [16] targeted 3 important genes viz Spike, Membrane and Envelope for multi-epitope-based vaccine design for SARS-CoV-2 with a 90% population coverage. Also, immune simulation suggested a significant increase in primary immune response with increased IgM and secondary immune response with increased IgG1 and IgG2 along with increased proliferation of T-helper cells with increased cytokines. Another study [17] targeted heptad repeats 1 and 2 (HR1 and HR2) in the Spike protein for peptide design using molecular dynamics simulation between the fusion of the viral membrane with the host cell membrane. This eventually limited the spread of the virus in the host cells. Vashi et al. [18] predicted 24 potential epitope fragments of which 20 were on the surface of Spike protein (S protein) and were considered to be helpful for designing potential immunogenic peptide based vaccines. Motivated by the literature and looking at the sudden surge of SARS-CoV-2 in India, a comprehensive bioinformatics pipeline is proposed in this work to analyse the virus genomes for understanding its evolution for identifying mutations as signature SNPs, conserved regions and subsequently to design epitope based synthetic vaccine. In this regard, we have performed multiple sequence alignment of 4996 Indian SARS-CoV-2 sequences as a case study using MAFFT followed by phylogenetic analysis of the aligned sequences using Nextstrain. As a result, the sequences are found to be distributed in 5 clades, viz 19A, 19B, 20A, 20B and 20C. Thereafter, from the aligned sequences, mutation points as SNPs are identified in each clade. Subsequently, top 10 signature SNPs based on their frequency are identified in each clade resulting in a total of 50 such SNPs. Out of 50 signature SNPs, 40 unique signature SNPs are identified resulting in 23 non-synonymous signature SNPs which gives 28 amino acid changes in protein which are visualised in protein structures as well. Furthermore, the sequence and structural homology-based prediction along with the protein structural stability of the amino acid changes for such SNPs are evaluated using PROVEAN, PolyPhen 2.0 and I-Mutant 2.0 in order to judge the characteristics of the identified clades. As a consequence, A97V in RdRp in 19A, V354L in Nucleocapsid in 19B, Q57H in Nucleocapsid in 20A, R203M in Nucleocapsid in 20B while T85I in NSP2 and Q57H in ORF3a in 20C are the unique amino acid changes which are responsible for defining each clade as they are all deleterious and unstable and are also responsible for decreasing the protein structural stability. Moreover, based on the entropy of each genomic coordinate of the aligned sequences, conserved regions are identified. Conserved regions are such places in genomic sequences for which the corresponding protein sequences remain unchanged. These conserved regions are then filtered based on the criteria that their lengths are greater than or equal to 125nt and their BLAST specificity score is equal to 100% resulting in 12 conserved regions belonging to NSP2, NSP8, NSP10, RdRp, Exon, Spike glycoprotein, ORF3a and ORF7a proteins. Based on its length, one conserved region as potential target is identified in the NSP10 gene for which the primers and probes are reported as well. Such primers and probes can be used for detecting SARS-CoV-2 virus. The 12 conserved regions are also used to identify the T-cell and B-cell epitopes along with their immunogenic and antigenic scores. Using such scores, most immunogenic and antigenic epitopes are selected for the 12 conserved regions thereby identifying 23 MHC-I and 22 MHC-II restricted T-cell epitopes with 10 unique HLA alleles each and 17 B-cell epitopes. Finally, the binding conformation of the MHC-I and MHC-II restricted T-cell epitopes with respect to HLA alleles are shown to judge their relevance. Also, the physico-chemical properties of the epitopes are reported along with structural properties using Ramchandran plots, ERRAT score and Z-Scores. Thus, based on the comprehensive bioinformatics pipeline, the main contributions of this work can be summarised as: (a) phylogenetic analysis in Nextstrain to identify virus clades, (b) identification of SNPs in the aligned sequences, (c) based on frequency, top 10 signature SNPs identification in each virus clade, (d) identification of conserved regions and based on length selecting one such region as potential target for reporting the corresponding primers and probes to detect SARS-CoV-2 and (e) identification of T-cell and B-cell epitopes for peptide based synthetic vaccine design.

Material and Methods

In this section, the details of data collection and the preparation are elucidated which is followed by a brief discussion on the pipeline of the workflow that has been considered in this work.

Data Collection and Preparation

The reference sequence of SARS-CoV-2 virus (NC_045512.2) is collected from National Center for Biotechnology Information (NCBI)2 while 4996 complete or near complete Indian SARS-CoV-2 genomes are collected from Global Initiative on Sharing All Influenza Data (GISAID)3 in fasta format. The 4996 SARS-CoV-2 sequences are mostly distributed from January 2020 to January 2021. These sequences are then aligned to find the conserved regions. The coded protein corresponding to each conserved region is extracted as well. Further, to map the protein sequences and changes in the amino acid, protein PDB are collected from Zhang Lab4 which are then used to model and identify the structural changes. All these analyses are executed on High Performance Computing (HPC) facility of NITTTR, Kolkata while the amino acid changes are checked in MATLAB R2019b. The HPC cluster has a master node with dual Intel Xeon Gold 6130 Processor having 32 Cores, 2.10 GHz, 22 MB L3 Cache and 128 GB DDR4 RAM and 2 GPU and 4 CPU computing nodes with dual Intel Xeon Gold 6152 Processor having 44 Cores, 2.1 GHz, 30 MB L3 Cache and 192 GB DDR4 RAM each, while GPU nodes have NVIDIA Tesla V100 GPU with 16 GB memory each. MSA is performed using the 2 GPU and 4 CPU computing nodes.

Pipeline of the work

The pipeline of this work is provided in Fig. 1 . In this work, a comprehensive bioinformatics pipeline is proposed which encompasses identifying mutation points as SNPs, conserved regions and finally design of epitope based synthetic vaccine. To achieve these goals, in the first phase of the pipeline, multiple sequence alignment of 4996 Indian SARS-CoV-2 genomes as a case study using MAFFT [19] is carried out followed by the phylogenetic analyses using Nextstrain [20]. As MAFFT uses fast fourier transform, it outperforms all the other alignment techniques. On the other hand, analysis of the evolution and spread of pathogens is done using Nextstrain by considering phylogenomic and phylogeographic data. The spread and evolution of virus genomes can be visualised at nextstrain.org using auspice. By using this tool, the evolution and geographic distribution of SARS-CoV-2 genomes are visualised by creating the metadata in our High Performance Computing environment. Once the identification of the virus clades are performed using Nextstrain, clade specific aligned sequences are used to identify mutation points as substitutions especially SNPs in each clade. Henceforth, codon table is used to identify the amino acid changes in the virus proteins corresponding to the SNPs. Thereafter, based on their frequency in the virus genome, top 10 signature SNPs are identified in each clade. Please note that the amino acid changes in the SNPs can be either synonymous or non-synonymous. Furthermore, amino acid changes in the non-synonymous SNPs are visualised in the protein structures and they are used to evaluate their functional characteristics as well.
Fig. 1

Pipeline of the work.

Pipeline of the work. The second phase of the pipeline entails identification of conserved Regions (CnRs) in the aligned sequences using entropy () which can be computed as:where represents the frequency of each residue x occurring at position y and 5 represents the four possible residues as nucleotides plus gap. To identify the conserved regions, a minimum segment length of 15 is considered with maximum average entropy as 0.2 along with a maximum entropy per position of 0.2 as well without any gaps. All these values are taken after following the literature. Thereafter, refinement criteria for the conserved regions are adopted based on the criteria that their lengths are 125nt and their BLAST specificity score as query coverage is equal to 100%. Subsequently, based on its length, a particular conserved region is considered as potential target which is then used to identify primers and probes using Primer-BLAST5 for SARS-CoV-2 detection. In the final phase of the pipeline, T-cell and B-cell epitopes along with their immunogenic and antigenic scores are predicted for the refined CnRs using IEDB6 and ABCPred7 respectively. For such MHC-I and MHC-II restricted T-cell epitopes, predictions are carried out using IEDB recommended NetMHCPan EL 4.18 and Consensus Approach9 [21] respectively while ABCPred [22] is used for B-cell epitope prediction. Thereafter, by using these predicted epitopes, antigenic scores are evaluated by VaxiJen 2.010 while the validation of the identified T-cell epitopes is carried out by studying their conformational 2D non-covalent structures using LigPlot+ [23]. For the verification of the predicted B-cell epitopes, BepiPred 2.011 [24] server is used. Allergen and toxicity properties of the epitopes are evaluated using AllerTop 2.012 and ToxinPred13 respectively. The physico-chemical properties are also evaluated using ToxinPred. Moreover, docking of all the T-cell epitopes are performed using AutoDock Vina [25] and their structural properties are reported using Ramachandran Plot [26], ERRAT score [27] and Verify 3D [28] using SAVES 6.014 . Finally, Z-Score evaluation is performed using ProSA [29].

Results

Phylogenetic analysis and Signature SNPs in each clade

To achieve the first step of the bioinformatics pipeline, multiple sequence alignment of 4996 Indian SARS-CoV-2 genomes is performed using MAFFT followed by phylogenetic analysis with the help of Nextstrain. This phylogenetic analysis results in 5 clades viz. 19A, 19B, 20A, 20B and 20C. Thereafter, mutation points as substitutions specifically SNPs are identified in each clade resulting in 708, 161, 3308, 3235 and 47 SNPs for 479, 88, 2486, 1925 and 18 sequences respectively in 19A, 19B, 20A, 20B and 20C. The details of the SNPs are provided in the supplementary Table S1. The resultant phylogenetic trees in radial and rectangular views are shown in Fig. 2 (a) and (c) while the clade wise geographical distribution of the 4996 sequences is shown in Fig. 2(b). The clade wise evolution of the sequences for each month of each Indian state is shown in the form of pie charts in supplementary Table S2 while the month wise evolution of such sequences for each clade is reported in supplementary Table S3. The corresponding colour representation for the five major clades and the months are provided in supplementary Figure S1. Moreover, the entropy values for the nucleotide changes and coding regions of the SARS-CoV-2 genome are shown respectively in Fig. 2(d) and (e). It is to be noted that for some sequences, the state name is not mentioned in the GISAID database. Thus, they are aggregated under the state name ‘India’.
Fig. 2

(a) Phylogenetic Tree in Radial view (b) Geographical Distribution (c) Phylogenetic Tree in Rectangular view (d) Value of Entropy for the change in Nucleotide (e) Coding Regions of SARS-CoV-2 Genome (f) Signature SNPs (g) Venn Diagram of 5 clades and (h) Identification of Primers and Probes using Primer-BLAST.

(a) Phylogenetic Tree in Radial view (b) Geographical Distribution (c) Phylogenetic Tree in Rectangular view (d) Value of Entropy for the change in Nucleotide (e) Coding Regions of SARS-CoV-2 Genome (f) Signature SNPs (g) Venn Diagram of 5 clades and (h) Identification of Primers and Probes using Primer-BLAST. Once the SNPs are determined for each clade, top 10 SNPs based on their frequency viz. signature SNPs are identified in each clade, thereby resulting in 50 signature SNPs as reported in Table 1 and visualised in Fig. 2(f). In unsupervised learning, feature selection is a very crucial task. In this work, frequency of a SNP is considered to be the feature selection criterion. For example, G11083A and G11083T with a frequency of 425 is the top signature SNP in clade 19A while for 19B, T28144C having frequency of 87 is the top signature SNP. Subsequently, 40 unique SNPs are identified which results in 23 non-synonymous signature SNPs with 28 corresponding amino acid changes. The common signature SNPs in the five clades are visualised using Venn diagram in Fig. 2(g). It is evident from the figure that the clades do not have any common SNPs, thereby confirming the fact that signature SNPs are indeed the defining features of a clade. Moreover, the amino acid changes are visualised in Fig. 3 as well. Please note that 27 amino acid changes are visualised in Fig. 3 as opposed to 28 reported changes; the discarded change is E110* in ORF8 as this amino acid change leads to a stop codon. Also, sequence and structure-based homology prediction of the amino acid changes for the non-synonymous SNPs are reported in Table 2 , the details of which are discussed in Discussion section. All the detailed results are provided in supplementary Table S1.
Table 1

List of Signature SNPs in each clade for 4996 Indian SARS-CoV-2 Genomes.

CladeGenomicFrequencyNucleotideProteinProteinMapped with Coding and
PositionChangeChangeCoordinateNon-Coding Region
19A11083425G>A, G>TSynonymous, L>F37NSP6
13730374C>TA>V97RdRp
28311364C>TP>L13Nucleocapsid
23929360C>TSynonymous789Spike
6312359C>T, C>AT>I, T>K1198NSP3
19524111C>TSynonymous495Exon
631098C>A, C>TS>R, Synonymous1197NSP3
139777G>AV>I198NSP2
2974277G>A,G>C, G>TNot PresentNot Present3’ UTR
2868874T>CSynonymous139Nucleocapsid

19B2814487T>CL>S84ORF8
878286C>TSynonymous76NSP4
2887883G>A,G>T, G>CS>N, S>I, S>T202Nucleocapsid
2974281G>A,G>C, G>TNot PresentNot Present3’ UTR
2246862G>T,G>ASynonymous, Synonymous302Spike
1123019G>TM>I86NSP6
794516C>TSynonymous1742NSP3
2816715G>AE>K92ORF8
27059A>GT>A634NSP2
145009G>TV>L354RdRp

20A234032472A>GD>G614Spike
2412458C>TNot PresentNot Present5’ UTR
30372455C>TSynonymous106NSP3
144082377C>TP>L323RdRp
267351432C>TSynonymous71Membrane
188771427C>TSynonymous280Exon
255631418G>A, G>T, G>CSynonymous, Q>H, Q>H57ORF3a
288541230C>TS>L194Nucleocapsid
224441191C>TSynonymous294Spike
2836557C>TSynonymous39NSP3

20B30371923C>TSynonymous106NSP3
2411922C>TNot PresentNot Present5’ UTR
234031922A>GD>G614Spike
144081912C>TP>L323RdRp
288811868G>A, G>TR>K, R>M203Nucleocapsid
288821868G>ASynonymous203Nucleocapsid
288831867G>A, G>CG>R, G>R204Nucleocapsid
3131120C>TSynonymous16Leader protein
57001106C>AA>D994NSP3
4354281G>ASynonymous545NSP3

20C24118C>TNot PresentNot Present5’ UTR
105918C>TT>I85NSP2
303718C>TSynonymous106NSP3
1440818C>TP>L323RdRp
2340318A>GD>G614Spike
2556318G>A, G>T, G>CSynonymous, Q>H, Q>H57ORF3a
162609C>TSynonymous8Helicase
288219C>AS>Y183Nucleocapsid
282214G>T, G>CE>-, E>Q110ORF8
283714G>TS>I33Nucleocapsid
Fig. 3

Highlighted amino acid changes in the protein structures for the non-synonymous signature SNPs of (a) NSP2 (b) NSP3 (c) NSP6 (d) RdRp (e) Spike (f) ORF3a (g) ORF8 and (h) Nucleocapsid.

Table 2

Sequence and structural homology-based prediction for non-synonymous signature SNPs along with their protein structural stability.

CladeGenomicAmino residueProteinPROVEAN
PolyPhen-2
I-Mutant 2.0
CoordinatesChangeEffectScorePredictionScoreStabilityDDG
19A11083L37FNSP6Neutral-1.369Benign0.027Decrease0.05
13730A97VRdRpDeleterious3.611Probably Damaging0.99Decrease0.53
28311P13LNucleocapsidNeutral-1.23Probably Damaging1.000Increase0.11
6312T1198INSP3Neutral-0.085Probably Damaging0.998Decrease-0.72
6312T1198KNSP3Neutral−0.353NGNGDecrease-1.37
6310S1197RNSP3Neutral-0.835NGNGDecrease-0.88
1397V198INSP2Neutral0.307Benign0.006Increase0.18



19B28144L84SORF8Neutral2.333Benign0.002Decrease-2.87
28878S202NNucleocapsidNeutral-0.404Probably Damaging0.994Decrease-0.8
28878S202INucleocapsidDeleterious-3.308Probably Damaging0.998Increase0.22
28878S202TNucleocapsidNeutral-1.428Probably Damaging0.986Decrease-0.53
11230M86INSP6Neutral-0.427Benign0.025Decrease-1.02
28167E92KORF8Neutral-1.5NGNGDecrease-1.05
2705T634ANSP2Neutral-0.004Benign0.106Decrease-1.13
14500V354LRdRpDeleterious2.581Probably Damaging0.997Decrease1.95



20A23403D614GSpikeNeutral0.598Benign0.004Decrease-1.94
14408P323LRdRpNeutral-0.865Benign0.005Decrease-0.80
25563Q57HORF3aDeleterious3.286Probably Damaging0.966Decrease1.12
28854S194LNucleocapsidDeleterious-4.272Probably Damaging0.994Increase0.45



20B23403D614GSpikeNeutral0.598Benign0.004Decrease-1.94
14408P323LRdRpNeutral-0.865Benign0.005Decrease-0.80
28881R203KNucleocapsidNeutral-1.604Probably Damaging0.969Decrease-2.26
28881R203MNucleocapsidDeleterious3.305Probably Damaging0.998Decrease1.52
28883G204RNucleocapsidNeutral-1.656Probably Damaging1Decrease0
5700A994DNSP3Neutral-1.103NGNGDecrease-0.78



20C1059T85INSP2Deleterious4.09Probably Damaging0.998Decrease1.71
14408P323LRdRpNeutral-0.865Benign0.005Decrease-0.80
23403D614GSpikeNeutral0.598Benign0.004Decrease-1.94
25563Q57HORF3aDeleterious3.286Probably Damaging0.966Decrease1.12
28821S183YNucleocapsidDeleterious-2.75Probably Damaging0.998Increase0
28221E110QORF8Neutral-0.25NGNGDecrease-1.13
28371S33INucleocapsidNeutral-1.372NGNGIncrease0.63
List of Signature SNPs in each clade for 4996 Indian SARS-CoV-2 Genomes. Highlighted amino acid changes in the protein structures for the non-synonymous signature SNPs of (a) NSP2 (b) NSP3 (c) NSP6 (d) RdRp (e) Spike (f) ORF3a (g) ORF8 and (h) Nucleocapsid. Sequence and structural homology-based prediction for non-synonymous signature SNPs along with their protein structural stability.

Selection of CnRs

For the next phase of this study, we have obtained 473 conserved regions (CnRs) which are then mapped to the 11 coding regions of SARS-CoV-2; ORF1ab, Spike, ORF3a, Envelope, Membrane, ORF6, ORF7a, ORF7b, ORF8, Nucleocapsid and ORF10. For each CnR, the corresponding protein sequence is taken according to the reading frame it is associated with. For example, protein sequence of CnR in Spike region is taken from Frame 2 while that belonging to Envelope and Membrane are taken from Frames 1 and 3 respectively. These 473 conserved regions are then filtered based on the criteria that the length of the CnR should be greater than or equal to 125nt and the their BLAST specificity score as query coverage is equal to 100%. As a result, we have obtained 12 such regions as reported in Table 3 . The table also shows the corresponding protein sequences for the conserved regions along with their length, BLAST specificity score, percent of BLAST specificity score as query coverage, coding regions, starting and ending coordinates, length of coding regions and the coded proteins. These CnRs belong to coding regions which code NSP2, NSP8, NSP10, RdRp, Exon, Spike glycoprotein, ORF3a protein and ORF7a protein. The details of all the initial and filtered CnRs are provided in the supplementary as an excel file. Also, based on its length, among these CnRs, one CnR is then chosen as the target for the detection of SARS-CoV-2. Moreover, the protein sequences of these CnRs are used to identify the MHC-I and MHC-II restricted T-cell and B-cell epitopes.
Table 3

Conserved Regions (CnRs) as derived from 4996 SARS-CoV-2 genomes with associated details

DNA Sequence ofProteinLengthBLAST Specificity% of BLAST SpecificityCodingStartingEndingLength ofCoded
Conserved Region (CnR)Sequenceof CnRScore of CnRScore as Query CoverageRegion (CR)CoordinateCoordinateCoding RegionProteins
1282-CACTTGCGAATTTTGTGGCACTGAGAATTTGACTAAAGAAGGTGCCACTACTTGTGGTTACTTACCCCAAAATGCTGTTGTTAAAATTTATTGTCCAGCATGTCACAATTCAGAAGTAGGACCTGAGCATAGTCTTG-1418TCEFCGTENLTKEGATTCGYLPQNAVVKIYCPACHNSEVGPEHSL137254100ORF1ab2662155221287NSP2
12422-AGAGATGGTTGTGTTCCCTTGAACATAATACCTCTTACAACAGCAGCCAAACTAATGGTTGTCATACCAGACTATAACACATATAAAAATACGTGTGATGGTACAACATTTACTTATGCATCAGCATTGTGGGAAAT-12558RDGCVPLNIIPLTTAAKLMVVIPDYNTYKNTCDGTTFTYASALWE137254100ORF1ab2662155221287NSP8
13125-GGGGACAACCAATCACTAATTGTGTTAAGATGTTGTGTACACACACTGGTACTGGTCAGGCAATAACAGTTACACCGGAAGCCAATATGGATCAAGAATCCTTTGGTGGTGCATCGTGTTGTCTGTACTGCCGTTGCCACATAGATCATCCAAATCCTAAAGGATTTTGTGACTTAAAAGGTAAGTATGTACAAATACCTACAACTTGTGCTAATGACCCTGTGGGTTTTACACTTAAAAACACAGT-13371GQPITNCVKMLCTHTGTGQAITVTPEANMDQESFGGASCCLYCRCHIDHPNPKGFCDLKGKYVQIPTTCANDPVGFTLKNT247457100ORF1ab2662155521290NSP10
14075-TCAATGGTAACTGGTATGATTTCGGTGATTTCATACAAACCACGCCAGGTAGTGGAGTTCCTGTTGTAGATTCTTATTATTCATTGTTAATGCCTATATTAACCTTGACCAGGGCTTTAACTGCAGAGTCAC-14206NGNWYDFGDFIQTTPGSGVPVVDSYYSLLMPILTLTRALTAES132244100ORF1ab2662155221287RdRp
14221-TTAACAAAGCCTTACATTAAGTGGGATTTGTTAAAATATGACTTCACGGAAGAGAGGTTAAAACTCTTTGACCGTTATTTTAAATATTGGGATCAGACATACCACCCAAATTGTGTTAACTGTTTGGATGACAGATGCATTCTGCATTGTGCAAACTTTAATGTTTTATTCTCTACAGTGTTCCCA-14406LTKPYIKWDLLKYDFTEERLKLFDRYFKYWDQTYHPNCVNCLDDRCILHCANFNVLFSTVFP186344100ORF1ab2662155221287RdRp
15607-TTACAACACAGACTTTATGAGTGTCTCTATAGAAATAGAGATGTTGACACAGACTTTGTGAATGAGTTTTACGCATATTTGCGTAAACATTTCTCAATGATGATACTCTCTGACGATGCTGTTGTGTGTTT-15737LQHRLYECLYRNRDVDTDFVNEFYAYLRKHFSMMILSDDAVVC131243100ORF1ab2662155221287RdRp
15991-GATGGTACACTTATGATTGAACGGTTCGTGTCTTTAGCTATAGATGCTTACCCACTTACTAAACATCCTAATCAGGAGTATGCTGATGTCTTTCATTTGTACTTACAATACATAAGAAAGCTACATGATGAGTTAACAGGACACATGTTAGACATGTATTCTGTTATGCTTACTAATGATAACACTTCAAGGTATTGGGAACCTGAGTTTTATGA-16205DGTLMIERFVSLAIDAYPLTKHPNQEYADVFHLYLQYIRKLHDELTGHMLDMYSVMLTNDNTSRYWEPEFY215398100ORF1ab2662155221287RdRp
18487-ATACCACTTATGTACAAAGGACTTCCTTGGAATGTAGTGCGTATAAAGATTGTACAAATGTTAAGTGACACACTTAAAAATCTCTCTGACAGAGTCGTATTTGTCTTATGGGCACATGGCTTTGAGTTGACATCTATGAAGTATTTTGTGAAAATAGGACCTGAGCGCACCTGTTGTCTATGT-18669IPLMYKGLPWNVVRIKIVQMLSDTLKNLSDRVVFVLWAHGFELTSMKYFVKIGPERTCCLC183339100ORF1ab2662155221287Exon
18980-ACATGGTTGTTAAAGCTGCATTATTAGCAGACAAATTCCCAGTTCTTCACGACATTGGTAACCCTAAAGCTATTAAGTGTGTACCTCAAGCTGATGTAGAATGGAAGTTCTATGATGCACAGCCTTGTAGTGACAAAGCTTATAAAATAGAAG-19132MVVKAALLADKFPVLHDIGNPKAIKCVPQADVEWKFYDAQPCSDKAYKIE153283100ORF1ab2662155221287Exon
24490-TTTAAATGATATCCTTTCACGTCTTGACAAAGTTGAGGCTGAAGTGCAAATTGATAGGTTGATCACAGGCAGACTTCAAAGTTTGCAGACATATGTGACTCAACAATTAATTAGAGCTGCAGAAATCAGAGC-24621LNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIR132244100Spike21563253813819Spike glycoprotein
25913-GCACAACAAGTCCTATTTCTGAACATGACTACCAGATTGGTGGTTATACTGAAAAATGGGAATCTGGAGTAAAAGACTGTGTTGTATTACACAGTTACTTCACTTCAGACTATTACCAGCTGTACTCAACTCAATTGAGTACAGACACT-26061TTSPISEHDYQIGGYTEKWESGVKDCVVLHSYFTSDYYQLYSTQLSTDT149276100ORF3a2539326217825ORF3a protein
27394-ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTACCAAGAGTGTGTTAGAGGTACAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATACGAGGGCA-27520MKIILFLALITLATCELYHYQECVRGTTVLLKEPCSSGTYEG127235100ORF7a2739427756363ORF7a protein
Conserved Regions (CnRs) as derived from 4996 SARS-CoV-2 genomes with associated details

Identification of Conserved Region as Target and associated Primers and Probes

Among the 12 CnRs identified, the CnR with the largest length of 247nt is considered to be a potential target. This CnR belongs to ORF1ab region, specifically NSP10 gene which is shown in Table 4 . With a Nucleotide BLAST score of 457 and BLAST specificity score as query coverage is equal to 100%, the global stability of this CnR as a global target is confirmed. The structure of the NSP10 gene as shown in Table 4 is taken from ZhangLab in the form of a PDB file and the CnR as target is highlighted in red. Using this conserved region, 10 primers and probes are identified from Primer-BLAST and reported in Table 5 and shown in Fig. 2(h). The table reports both the forward and the reverse primers. Moreover, high GC scores (45%-53%) of the identified primers suggest that the identified primers and probes can be used in RT-PCR for SARS-CoV-2 detection in order to correctly diagnose COVID-19 patients. Therefore, the target region of NSP10 gene can be considered as a confirmatory assay. It is to be noted that based on its adhesive properties, Ong et al. [30] have predicted NSP10 as a possible vaccine candidate.
Table 4

Targeted Conserved Region in SARS-CoV-2 Genome and its corresponding protein sequence in NSP10 which is highlighted by red colour in NSP10 gene.

DNA Sequence ofProteinNSP10 protein structure
Conserved Region (CnR)Sequencewith target region
13125-GGGGACAACCAATCACTAATTGTGTTAAGATGTTGTGTACACACACTGGTACTGGTCAGGCAATAACAGTTACACCGGAAGCCAATATGGATCAAGAATCCTTTGGTGGTGCATCGTGTTGTCTGTACTGCCGTTGCCACATAGATCATCCAAATCCTAAAGGATTTTGTGACTTAAAAGGTAAGTATGTACAAATACCTACAACTTGTGCTAATGACCCTGTGGGTTTTACACTTAAAAACACAGT-1337135-GQPITNCVKMLCTHTGTGQAITVTPEANMDQESFGGASCCLYCRCHIDHPNPKGFCDLKGKYVQIPTTCANDPVGFTLKNT-115
Table 5

Details of Primers and Probes of NSP10 gene.

PrimerPrimers
PairTypeSequence (5’->3’)LengthTmGC%Probe SequenceProbe Length
1Forward117-TGTTGTCTGTACTGCCGTTG-1362060.0550TGTTGTCTGTACTGCCGTTGCCACATAGATCATCCAAATCCTAAAGGATTTTGTGACTTAAAAGGTAAGTATGTACAAATACCTACAACTTGTGCTAATGACCCTGTGGGTTT113
Reverse229-AAACCCACAGGGTCATTAGC-2102059.4650
2Forward64-TAACAGTTACACCGGAAGCC-832059.1850TAACAGTTACACCGGAAGCCAATATGGATCAAGAATCCTTTGGTGGTGCATCGTGTTGTCTGTACTGCCGTTGCCACATAGA82
Reverse145-TCTATGTGGCAACGGCAGTA-1262060.7650
3Forward95-AGAATCCTTTGGTGGTGCAT-1142059.0845AGAATCCTTTGGTGGTGCATCGTGTTGTCTGTACTGCCGTTGCCACATAGATCATCCAAATCCTAAAGGATTTTGTGACTTAAAAGGTAAGTATGTACAAATACCTACAACTTGTGCTAATGACCCTGTGGGTTTT136
Reverse230-AAAACCCACAGGGTCATTAGC-2102160.1647.62
4Forward35-GTGTACACACACTGGTACTGG-552159.8952.38GTGTACACACACTGGTACTGGTCAGGCAATAACAGTTACACCGGAAGCCAATATGGATCAAGAATCCTTTGGTGGTGCATCGTGTT86
Reverse120-AACACGATGCACCACCAAAG-1012060.9750
5Forward45-ACTGGTACTGGTCAGGCAATA-652160.1647.62ACTGGTACTGGTCAGGCAATAACAGTTACACCGGAAGCCAATATGGATCAAGAATCCTTTGGTGGTGCATCGTGTTGTCTG81
Reverse125-CAGACAACACGATGCACCA-107196052.63
6Forward101-CTTTGGTGGTGCATCGTGTT-1202060.9750CTTTGGTGGTGCATCGTGTTGTCTGTACTGCCGTTGCCACATAGATCATCCAAATCCTAAAGGATTTTGTGACTTAAAAGGTAAGTATGTACAAATACCTACAACTTGTGCTAATGACCCTGTGGGTTTTACAC134
Reverse234-GTGTAAAACCCACAGGGTCAT-2142159.8147.62
7Forward119-TTGTCTGTACTGCCGTTGC-137196052.63TTGTCTGTACTGCCGTTGCCACATAGATCATCCAAATCCTAAAGGATTTTGTGACTTAAAAGGTAAGTATGTACAAATACCTACAACTTGTGCTAATGACCCTGTGGGTTTTACACTT118
Reverse236-AAGTGTAAAACCCACAGGGTC-2162159.7447.62
8Forward66-ACAGTTACACCGGAAGCCAA-852061.250ACAGTTACACCGGAAGCCAATATGGATCAAGAATCCTTTGGTGGTGCATCGTGTTGTCTGTACTGCCGTTGCCACATAGATCATCCA87
Reverse152-TGGATGATCTATGTGGCAACG-1322159.8147.62
9Forward44-CACTGGTACTGGTCAGGCAA-632061.2755CACTGGTACTGGTCAGGCAATAACAGTTACACCGGAAGCCAATATGGATCAAGAATCCTTTGGTGGTGCATCGTGTT77
Reverse120-AACACGATGCACCACCAAA-1021959.8447.37
10Forward65-AACAGTTACACCGGAAGCCA-842061.250AACAGTTACACCGGAAGCCAATATGGATCAAGAATCCTTTGGTGGTGCATCGTGTTGTCTGTACTGCCGTTGCCACATA79
Reverse143-TATGTGGCAACGGCAGTACA-1242061.3450
Targeted Conserved Region in SARS-CoV-2 Genome and its corresponding protein sequence in NSP10 which is highlighted by red colour in NSP10 gene. Details of Primers and Probes of NSP10 gene.

Identification of T-cell Epitopes

To achieve the final phase of the pipeline, design of epitope based synthetic vaccine is carried out. To predict the epitopes from the 12 CnRs, the corresponding protein sequences are fed to the various tools as inputs. For the prediction of MHC-I restricted T-cell epitopes, IEDB recommended NetMHCPan EL 4.1 [31] is considered targeting 27 unique HLA alleles. For each CnR, this resulted in the selection of 5 best HLA allele binder epitopes based on their immunogenic scores. Thereafter, these best binders are provided as input to VaxiJen 2.0 [32] server for antigenic score prediction [31] with a cut-off score of 0.4. Any epitope beyond this cut-off are considered to be antigenic. Therefore, a total of 60 epitopes, each of length 9–10 mer, are obtained along with their immunogenic and antigenic scores. From each of the 12 CnRs, the most immunogenic and antigenic MHC-I restricted T-cell epitopes are identified resulting in 22 such epitopes and reported in Table 6 . With a score of 0.99, the most immunogenic epitopes are SEVGPEHSL, DTDFVNEFY and QEYADVFHLY bounded to HLA-B*40:01, HLA-A*01:01 and HLA-B*44:03 alleles respectively belonging to NSP2 and RdRp coded proteins. On the other hand, with a score of 1.43, HPNPKGFCDL is the most antigenic epitope belonging to NSP10 coded protein and bounded to HLA-B*07:02 allele.
Table 6

List of most Immunogenic and Antigenic Epitopes for MHC-I, MHC-II restricted T-cell and B-cell Epitopes for 12 CnRs. *I.S.-Immunogenic Score; A.S.-Antigenic Score.

Protein SequenceCodedTypeMHC-I restricted T-cell
MHC-II restricted T-cell
B-cell Epitopes
ProteinEpitopesAllelesI.S.*A.S.*EpitopesAllelesI.S.*A.S.*EpitopesI.S.*A.S.*
160-TCEFCGTENLTKEGATTCGYLPQNAVVKIYCPACHNSEVGPEHSL-204NSP2ImmunogenicSEVGPEHSLHLA-B*40:010.990.72TTCGYLPQNAVVKIYHLA-DRB5*01:014.300.04VVKIYCPACHNSEVGP0.960.66
AntigenicNSEVGPEHSLHLA-B*40:010.790.82ATTCGYLPQNAVVKIHLA-DRB5*01:015.200.18
111-RDGCVPLNIIPLTTAAKLMVVIPDYNTYKNTCDGTTFTYASALWE-155ImmunogenicNTCDGTTFTYHLA-A*01:010.97-0.03VPLNIIPLTTAAKLMHLA-DRB1*08:020.250.88MVVIPDYNTYKNTCDG0.940.24
AntigenicTTFTYASALWHLA-B*57:010.950.40GCVPLNIIPLTTAAKHLA-DRB1*08:020.271.13VPLNIIPLTTAAKLMV0.570.74
35-GQPITNCVKMLCTHTGTGQAITVTPEANMDQESFGGASCCLYCRCHIDHPNPKGFCDLKGKYVQIPTTCANDPVGFTLKNT-115NSP10ImmunogenicDLKGKYVQIHLA-B*08:010.921.38LKGKYVQIPTTCANDHLA-DRB1*04:010.490.63RCHIDHPNPKGFCDLK0.930.72
AntigenicHPNPKGFCDLHLA-B*07:020.691.43DLKGKYVQIPTTCANHLA-DRB1*04:010.510.86PNPKGFCDLKGKYVQI0.661.55
213-NGNWYDFGDFIQTTPGSGVPVVDSYYSLLMPILTLTRALTAES-255RdRpImmunogenicSLLMPILTLHLA-A*02:010.790.21SYYSLLMPILTLTRAHLA-DRB1*01:010.160.55DFIQTTPGSGVPVVDS0.930.36
AntigenicSGVPVVDSYHLA-B*35:010.660.59VDSYYSLLMPILTLTR0.620.47
261-LTKPYIKWDLLKYDFTEERLKLFDRYFKYWDQTYHPNCVNCLDDRCILHCANFNVLFSTVFP-322RdRpImmunogenicKLFDRYFKYHLA-A*32:010.95-0.05TEERLKLFDRYFKYWHLA-DPA1*01:03/DPB1*02:010.760.18YFKYWDQTYHPNCVNC0.880.75
AntigenicRLKLFDRYFKYWDQTHLA-DPA1*01:03/DPB1*02:011.200.44
723-LQHRLYECLYRNRDVDTDFVNEFYAYLRKHFSMMILSDDAVVC-765RdRpImmunogenicDTDFVNEFYHLA-A*01:010.990.25NEFYAYLRKHFSMMIHLA-DRB1*11:010.020.23HRLYECLYRNRDVDTD0.830.23
AntigenicYLRKHFSMMHLA-B*08:010.880.49EFYAYLRKHFSMMILHLA-DRB1*11:010.050.39
851-DGTLMIERFVSLAIDAYPLTKHPNQEYADVFHLYLQYIRKLHDELTGHMLDMYSVMLTNDNTSRYWEPEFY-921RdRpImmunogenicQEYADVFHLYHLA-B*44:030.990.27VFHLYLQYIRKLHDEHLA-DRB4*01:010.370.28GHMLDMYSVMLTNDNT0.910.43
AntigenicQEYADVFHLHLA-B*40:010.980.36HMLDMYSVMLTNDNTHLA-DRB1*04:050.420.55HPNQEYADVFHLYLQY0.770.55
150-IPLMYKGLPWNVVRIKIVQMLSDTLKNLSDRVVFVLWAHGFELTSMKYFVKIGPERTCCLC-210ExonImmunogenicNLSDRVVFVHLA-A*02:030.940.95VRIKIVQMLSDTLKNHLA-DRB4*01:010.380.29GFELTSMKYFVKIGPE0.871.17
AntigenicPWNVVRIKIVQMLSDHLA-DRB4*01:010.410.46
315-MVVKAALLADKFPVLHDIGNPKAIKCVPQADVEWKFYDAQPCSDKAYKIE-364ExonImmunogenicLLADKFPVLHLA-A*02:010.940.08MVVKAALLADKFPVLHLA-DPA1*01:03/DPB1*02:011.300.40KCVPQADVEWKFYDAQ0.801.34
AntigenicKCVPQADVEWHLA-B*57:010.901.09
977-LNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIR-1019Spike glycoproteinImmunogenicAEVQIDRLIHLA-B*44:030.90-0.56VEAEVQIDRLITGRLHLA-DRB1*03:011.10-0.37DRLITGRLQSLQTYVT0.77-0.36
AntigenicRLDKVEAEVHLA-A*02:010.830.08LQTYVTQQLIRAAEIHLA-DRB4*01:012.700.02LNDILSRLDKVEAEVQ0.510.17
175-TTSPISEHDYQIGGYTEKWESGVKDCVVLHSYFTSDYYQLYSTQLSTDT-223ORF3a proteinImmunogenicFTSDYYQLYHLA-A*01:010.98-0.11VLHSYFTSDYYQLYSHLA-DPA1*01:03/DPB1*04:010.170.06TSPISEHDYQIGGYTE0.930.72
AntigenicSEHDYQIGGYHLA-B*44:030.911.04HSYFTSDYYQLYSTQHLA-DPA1*01:03/DPB1*04:010.330.25
1-MKIILFLALITLATCELYHYQECVRGTTVLLKEPCSSGTYEG-42ORF7aImmunogenicQECVRGTTVLHLA-B*40:010.830.60ILFLALITLATCELYHLA-DRB1*01:010.160.19TCELYHYQECVRGTTV0.810.53
AntigenicILFLALITLHLA-A*02:010.450.82
List of most Immunogenic and Antigenic Epitopes for MHC-I, MHC-II restricted T-cell and B-cell Epitopes for 12 CnRs. *I.S.-Immunogenic Score; A.S.-Antigenic Score. Similarly, MHC-II restricted T-cell epitopes are predicted using IEDB recommended consensus approach targeting a different set of 27 unique HLA alleles resulting in 60 epitopes, each of length 15 mer. Subsequently, the most immunogenic and antigenic MHC-II restricted T-cell epitopes are identified for the 12 CnRs which resulted in 21 such epitopes as reported in Table 6. It is to be noted that a MHC-II restricted T-cell epitope with a low immunogenic score is a better vaccine candidate. Thus, with a score of 0.02, NEFYAYLRKHFSMMI belonging to RdRp coded protein and bounded to HLA-DRB1*11:01 allele is the most immunogenic epitope while the most antigenic epitope is GCVPLNIIPLTTAAK belonging to NSP8 coded protein and bounded to HLA-DRB1*08:02 allele. All the 60 MHC-I and MHC-II restricted T-cell epitopes along with their HLA alleles are provided in the supplementary as an excel file and the corresponding link is provided in Table S1.

Identification of B-cell Epitopes

Epitope designing consists of both T-cell as well as B-cell epitopes; the latter one is particularly important for antigen production against a virus. In this regard, ABCPred is used for the prediction of B-cell epitopes where a threshold of 0.5 is maintained to consider the epitopes beyond this threshold to be immunogenic. With a cut-off value of 0.4, VaxiJen 2.0 server is used to evaluate the antigenic scores of the epitopes. Thus, we have identified 50 linear B-cell epitopes, each of length 16 mer, for the 12 CnRs, among which 17 are selected to be the most immunogenic and antigenic as shown in Table 6. These epitopes are also verified with the help of BepiPred 2.0 server and their corresponding graphical analysis is shown in supplementary Figure S2 where the red line represents the threshold which is set to 0.35 and the total green and yellow regions indicate a protein sequence. The most immunogenic and antigenic B-cell epitopes as reported in Table 6 are respectively VVKIYCPACHNSEVGP belonging to NSP2 coded protein and PNPKGFCDLKGKYVQI belonging to NSP10 coded protein. Their corresponding graphical representations are provided in supplementary Figure S2 (a) and (c) respectively. All the 50 B-cell epitopes are provided the supplementary as an excel file and the corresponding link is provided in Table S1. Additionally, in Table 7 we have provided a summarised list of all the epitopes belonging to these 12 CnRs along with their allergic and toxicity characteristics predicted using AllerTOP 2.015 and ToxinPred16 where 12, 6 and 8 allergic MHC-I, MHC-II T-cell and B-cell epitopes are identified respectively while only 1 and 5 epitopes in MHC-I restricted T-cell and B-cell epitopes are found to be toxic. The 3D structures of the epitopes summarised in Table 7 are further highlighted in Fig. 4 using ChimeraX. For better understandability, the identified epitopes are underlined in supplementary Figure S3.
Table 7

Summary of the most Immunogenic and Antigenic Epitopes along with the Allergic and Toxicity values.

Coded ProteinsMHC-I restricted T-cell EpitopesAllergicToxicityMHC-II restricted T-cell EpitopesAllergicToxicityLinear B-cell EpitopesAllergicToxicity
NSP2SEVGPEHSLNon-AllergenNon-ToxinTTCGYLPQNAVVKIYNon-AllergenNon-ToxinVVKIYCPACHNSEVGPAllergenNon-Toxin
NSEVGPEHSLAllergenNon-ToxinATTCGYLPQNAVVKINon-AllergenNon-Toxin
NSP8NTCDGTTFTYAllergenNon-ToxinVPLNIIPLTTAAKLMNon-AllergenNon-ToxinMVVIPDYNTYKNTCDGNon-AllergenNon-Toxin
TTFTYASALWAllergenNon-ToxinGCVPLNIIPLTTAAKNon-AllergenNon-ToxinVPLNIIPLTTAAKLMVNon-AllergenNon-Toxin
NSP10DLKGKYVQIAllergenNon-ToxinLKGKYVQIPTTCANDAllergenNon-ToxinRCHIDHPNPKGFCDLKAllergenToxin
HPNPKGFCDLAllergenToxinDLKGKYVQIPTTCANAllergenNon-ToxinPNPKGFCDLKGKYVQIAllergenNon-Toxin
RdRpSLLMPILTLNon-AllergenNon-ToxinSYYSLLMPILTLTRANon-AllergenNon-ToxinDFIQTTPGSGVPVVDSNon-AllergenNon-Toxin
SGVPVVDSYAllergenNon-ToxinVDSYYSLLMPILTLTRAllergenNon-Toxin
RdRpKLFDRYFKYNon-AllergenNon-ToxinTEERLKLFDRYFKYWAllergenNon-ToxinYFKYWDQTYHPNCVNCNon-AllergenToxin
RLKLFDRYFKYWDQTAllergenNon-Toxin
RdRpDTDFVNEFYAllergenNon-ToxinNEFYAYLRKHFSMMINon-AllergenNon-ToxinHRLYECLYRNRDVDTDNon-AllergenToxin
YLRKHFSMMNon-AllergenNon-ToxinEFYAYLRKHFSMMILNon-AllergenNon-Toxin
RdRpQEYADVFHLYAllergenNon-ToxinVFHLYLQYIRKLHDENon-AllergenNon-ToxinGHMLDMYSVMLTNDNTAllergenNon-Toxin
QEYADVFHLAllergenNon-ToxinHMLDMYSVMLTNDNTAllergenNon-ToxinHPNQEYADVFHLYLQYNon-AllergenToxin
ExonNLSDRVVFVNon-AllergenNon-ToxinVRIKIVQMLSDTLKNNon-AllergenNon-ToxinGFELTSMKYFVKIGPENon-AllergenNon-Toxin
PWNVVRIKIVQMLSDNon-AllergenNon-Toxin
ExonLLADKFPVLAllergenNon-ToxinMVVKAALLADKFPVLAllergenNon-ToxinKCVPQADVEWKFYDAQNon-AllergenNon-Toxin
KCVPQADVEWNon-AllergenNon-Toxin
Spike glycoproteinAEVQIDRLINon-AllergenNon-ToxinVEAEVQIDRLITGRLNon-AllergenNon-ToxinDRLITGRLQSLQTYVTNon-AllergenNon-Toxin
RLDKVEAEVAllergenNon-ToxinLQTYVTQQLIRAAEINon-AllergenNon-ToxinLNDILSRLDKVEAEVQAllergenNon-Toxin
ORF3aFTSDYYQLYAllergenNon-ToxinVLHSYFTSDYYQLYSNon-AllergenNon-ToxinTSPISEHDYQIGGYTEAllergenNon-Toxin
SEHDYQIGGYNon-AllergenNon-ToxinHSYFTSDYYQLYSTQNon-AllergenNon-Toxin
ORF7aQECVRGTTVLNon-AllergenNon-ToxinILFLALITLATCELYNon-AllergenNon-ToxinTCELYHYQECVRGTTVAllergenToxin
ILFLALITLNon-AllergenNon-Toxin
Fig. 4

Modelling of MHC-I, MHC-II restricted T-cell and B-cell epitopes for 12 CnRs belonging to (a) NSP2 (b) NSP8 (c) NSP10 (f) RdRp (f) Exon (g) Spike glycoprotein (h) ORF3a and (i) ORF7a.

Summary of the most Immunogenic and Antigenic Epitopes along with the Allergic and Toxicity values. Modelling of MHC-I, MHC-II restricted T-cell and B-cell epitopes for 12 CnRs belonging to (a) NSP2 (b) NSP8 (c) NSP10 (f) RdRp (f) Exon (g) Spike glycoprotein (h) ORF3a and (i) ORF7a.

Discussion

Since its emergence in Wuhan, China, SARS-CoV-2 has spread very rapidly around the world resulting in a global pandemic. Though the vaccination process has started, the number of COVID affected patients is still quite large. The waves of COVID-19 pandemic are a huge threat to the human population. In this regard, it is important to develop a bioinformatics pipeline in order to conduct in-depth analysis of SARS-CoV-2 genomes in every one or two months for next four to five years to know the evolution, genetic variability, virus strains and conserved regions, thereby to use such information for proper vaccine. Moreover, the mutated variants found in India are also a major concern of the researchers. Thus, identification of virus strains is very essential in today’s scenario. Moreover, vaccine is the only ray of hope in this dire situation, thereby making development of peptide based synthetic vaccine viz. epitopes even more necessary. In this regard, we have analysed 4996 Indian SARS-CoV-2 genomes which has resulted in the identification of five clades and subsequently 10 signature SNPs in each clade. Also, based on entropy, conserved regions are identified for the aligned sequences and primers and probes are identified as well for SARS-CoV-2 detection. Furthermore, we have identified T-cell and B-cell epitopes for the development of vaccines. Structural changes in amino acid residues can often result in changes in the protein translations which is conducive to functional instability of the proteins. In this regard, sequence and structural homology-based prediction of the amino acid changes in the non-synonymous signature SNPs along with their protein stability for the 4996 sequences are reported in Table 2 using PROVEAN (Protein Variation Effect Analyser) [33], PolyPhen-2 (Polymorphism Phenotyping) [34] and I-Mutant 2.0 [35] to judge the characteristics of the identified clades. PROVEAN17 works with sequence based prediction algorithm while Polyphen-218 uses prediction based on sequence, structural and phylogenetic information of a SNP. I-Mutant 2.019 uses support vector machine (SVM) for the automatic prediction of protein stability changes for SNPs. PROVEAN and PolyPhen-2 are used to find the deleterious or damaging non-synonymous SNPs. The threshold value of PROVEAN is set to −2.5. If the PROVEAN score of a SNP is le this threshold, the corresponding non-synonymous mutation is deleterious. For Polyphen-2, this range is between 0 to 1. If the score is closer to 1, mutations are more confidently considered to be damaging. As reported in Table 2 by considering the consensus of PROVEAN and Polyphen-2, out of the 28 unique amino acid changes, 8 unique changes are deleterious and damaging. Moreover, protein stability is important for considering the functional and structural activity of a protein. Any change in protein stability may cause degradation of proteins. The protein stabilities for the non-synonymous signature SNPs are determined using I-Mutant 2.0. The changes in the protein stability in I-Mutant 2.0 tool is predicted using free energy change values (DDG). A decrease in protein stability is indicated by a zero or a negative value of DDG. Table 2 shows that out of the 8 unique changes, 5 unique changes show a decrease in the stability of the protein structures. As a consequence, A97V in RdRp in 19A, V354L in Nucleocapsid in 19B, Q57H in Nucleocapsid in 20A, R203M in Nucleocapsid in 20B while T85I in NSP2 and Q57H in ORF3a in 20C are the unique amino acid changes which are responsible for defining each clade as they are all deleterious and unstable as well as decrease the stability of the protein structures. All of them are marked in bold in Table 2. Physico-chemical properties are considered to show the significance of the epitopes as reported in this paper. For each property, the physico-chemical values lie between 0 and 1. The physico-chemical properties for MHC-I, MHC-II restricted T-cells and B-cell epitopes belonging to the 12 CnRs are reported in Supplementary Tables S4, S5 and S6 respectively. As reported in Table S4, MHC-I restricted T-cell epitope SEVGPEHSL has a hydrophobicity value of −0.11, steric hinderance value of 0.52, hydropathicity of −0.64, amphipathicity of 0.44, hydrophilicity of 0.36, net hydrogen of 0.56, charge of −1.5, pI of 4.51 and molecular weight of 954.13. For the other epitopes, their physico-chemical properties are reported in the corresponding tables as well. For further validation, the conformational 2D non-covalent structures of the MHC-I and MHC-II restricted T-cell epitopes are studied using LigPlot+. Also, it is very important to study their structural characteristics such as binding conformation. Hence, to identify the stable binding interactions, molecular docking of the MHC-I and MHC-II restricted T-cell epitopes are evaluated using AutoDock Vina. For the same, first the 3D structures of the epitopes are prepared with the help of the build structure function of Chimera 1.14 along with the retrieval of the crystal structures of the HLA alleles in PDB format from RCSB Protein Data Bank. To identify the binding energy at the binding groove in the HLA allele, grid search space is set to (60,60,60) with centre of grid at (0,0,0) for X,Y Z coordinate with a spacing parameter of 0.964. The best is selected by higher binding affinity i.e. lowest docking score generated through Autodock Vina. Also, we have used DOE-MBI services such as PROCHECK, ERRAT, Verify3D for Ramchandran Plot, structure quality and 3D structure verification respectively. The results of the docking analysis along with Z-score, respective PDB ID20 , total energy of the 3D complex, van der Waals energy and electric energy of each complex are reported in Table 8, Table 9 respectively for MHC-I and MHC-II restricted T-cell epitopes. The results of SEVGPEHSL and NEFYAYLRKHFSMMI which are the most immunogenic and HPNPKGFCDL and GCVPLNIIPLTTAAK which are the most antigenic MHC-I and MHC-II restricted T-cell epitopes are shown respectively in Fig. 5, Fig. 6, Fig. 7, Fig. 8 while the results of DTDFVNEFY and QEYADVFHLY which are also the most immunogenic MHC-I restricted T-cell epitopes are shown in supplementary Figures S11 and S13 respectively. In these figures, (a) shows the docked complex with epitopes (marked in green) interacting in the HLA pocket where for MHC-I the docking scores are −7.02, −7.786, −8.848 and −7.438 while for MHC-II the scores are −8.465 and −7.298 generated from AutoDock Vina, (b) shows the 2D binding representation between the epitopes and the respective allele pair, (c) shows the ERRAT Score (d) shows the Z-Score where negative scores of −8.92, −8.98, −8.95 and −8.98 for MHC-I and −9.50 and −8.91 for MHC-II represent the stability of the structures of the identified epitopes, (e) represents Ramchandran Plot which has been evaluated using PROCHECK where most favourable region for the residue is shown in the red regions, (f) shows the energy residue plot generated using Verify 3D in Chain A of the docked complex and (g) shows the energy residue plot generated using Verify 3D in Chain B of the docked complex. Similar structural based evaluation are carried out for all the T-cell epitopes of the 12 conserved regions and reported in supplementary figures S4-S42.
Table 8

Docking and Z-scores of most Immunogenic and Antigenic MHC-I restricted T-cell epitopes for 12 CnRs.

MHC-I restricted T-cell epitopesAllele PDB IDScore from AutoDock VinaTotal EnergyvdW EnergyElectric EnergyERRAT ScoreZ Score
SEVGPEHSL3LN4:A-7.0256.5974.242-84.05892.1127-8.92
NSEVGPEHSL3LN4:A-7.82662.780.135-71.23792.1127-8.92
NTCDGTTFTY3BO8:A-7.89679.4780.388-72.21182.3529-8.98
TTFTYASALW3VRI:A-9.932131.03-26.04-49.881.5642-9.27
DLKGKYVQI4QRU:A-8.00730.829-7.715-80.480.4469-9.48
HPNPKGFCDL4U1H:A-7.43851.815-3.509-61.08384.9582-8.97
SLLMPILTL3UTQ:A-8.166117.669-10.804-48.97683.3333-9.38
SGVPVVDSY2CIK:A-8.07479.882-6.491-77.61584.0336-9.28
KLFDRYFKY5E00:A-8.32338.0630.837-81.05285.1955-8.77
DTDFVNEFY3BO8:A-7.78684.77-1.521-75.16282.3529-8.98
YLRKHFSMM4QRU:A-8.02940.78-18.508-41.45980.4469-9.48
QEYADVFHLY1N2R:A-8.84888.793-9.037-85.6685.1955-8.95
QEYADVFHL3LN4:A-7.99648.8241.057-95.90692.1127-8.92
NLSDRVVFV3OX8:A-7.3212.558-17.624-83.82482.5843-9.3
LLADKFPVL3UTQ:A-7.84560.256-0.423-73.61283.3333-9.38
KCVPQADVEW3VRI:A-7.36244.6189.799-82.42681.5642-9.27
AEVQIDRLI1N2R:A-7.302-5.739-14.044-59.42385.1955-8.95
RLDKVEAEV3UTQ:A-7.406-35.156-10.383-59.38983.3333-9.38
FTSDYYQLY3BO8:A-8.00791.699-12.984-63.35183.3333-8.98
SEHDYQIGGY1N2R:A-9.45867.521-29.967-56.64285.1955-8.95
QECVRGTTVL3LN4:A-8.409-0.982-8.186-75.8292.1127-8.92
ILFLALITL3UTQ:A-8.656123.773-19.829-50.91383.3333-9.38
Table 9

Docking and Z-scores of most Immunogenic and Antigenic MHC-II restricted T-cell epitopes for 12 CnRs.

MHC-II restricted T-cell epitopesAllele PDB IDScore from AutoDock VinaTotal EnergyvdW EnergyElectric EnergyERRAT ScoreZ Score
TTCGYLPQNAVVKIY1FV1:B-8.18751.807-11.448-73.61683.3333-9.38
ATTCGYLPQNAVVKI1FV1:B-7.00253.4573.071-74.54292.1127-8.92
VPLNIIPLTTAAKLM6CPN:B-7.13476.07-0.246-70.52482.3529-8.98
GCVPLNIIPLTTAAK1X7Q:A−7.298117.6747.064-70.2283.7079−8.91
LKGKYVQIPTTCAND4MD4:B-7.16826.78618.782-118.48584.0336-9.28
DLKGKYVQIPTTCAN4MD4:B-7.59851.579-8.601-62.76584.0336-9.28
SYYSLLMPILTLTRA2G9H:B-8.18593.108-19.626-34.57484.0782-9.21
TEERLKLFDRYFKYW3WEX:A; 3WEX:B-8.07335.351-8.623-76.36883.7079-8.95
RLKLFDRYFKYWDQT3WEX:A; 3WEX:B-8.56877.593-17.304-51.47588.169-8.93
NEFYAYLRKHFSMMI1A6A:B-8.465100.048-14.017-61.44787.9552-9.5
EFYAYLRKHFSMMIL1A6A:B-10.03247.328-36.397-46.92288.4831-8.97
VFHLYLQYIRKLHDE1T5W:B-7.43133.396-7.497-60.17880.4469-9.48
HMLDMYSVMLTNDNT4MD4:B” -8.019”88.304-12.212-63.94383.7535-8.95
VRIKIVQMLSDTLKN1T5W:B-6.854-59.10537.684-153.88877.7465-9.09
PWNVVRIKIVQMLSD1T5W:B-7.87792.966-19.085-38.80883.3333-9.38
MVVKAALLADKFPVL3WEX:A; 3WEX:B-7.2897.9271.388-98.58477.7465-9.09
VEAEVQIDRLITGRL1A6A:B-7.8452.052-10.221-87.5783.7079-8.95
LQTYVTQQLIRAAEI1T5W:B-8.08024.104-8.501-96.55177.7465-9.09
VLHSYFTSDYYQLYS3WEX:A; 3WEX:B-7.45340.9045.179-116.22381.5642-9.27
HSYFTSDYYQLYSTQ3WEX:A; 3WEX:B-7.964107.759-16.583-52.0582.3529-8.98
ILFLALITLATCELY2G9H:B-8.45639.487-18.368-86.62985.9944-8.83
Fig. 5

Structural analysis for the most immunogenic MHC-I restricted T-cell epitope “SEVGPEHSL” in 12 CnRs (a) Docking structure of MHC-I restricted T-cell epitope (b) 2D pose representation between the epitope and HLA allele showing the different non-covalent bonds (c) ERRAT Score (d) Z-Score plot (e) Ramachandran plot of the epitope allele structure showing lower energy sites of the residues in different frames and (f) Verify 3D scores in Chain A of the docked complex (g) Verify 3D scores in Chain B of the docked complex.

Fig. 6

Structural analysis for the most antigenic MHC-I restricted T-cell epitope “HPNPKGFCDL” in 12 CnRs (a) Docking structure of MHC-I restricted T-cell epitope (b) 2D pose representation between the epitope and HLA allele showing the different non-covalent bonds (c) ERRAT Score (d) Z-Score plot (e) Ramachandran plot of the epitope allele structure showing lower energy sites of the residues in different frames and (f) Verify 3D scores in Chain A of the docked complex (g) Verify 3D scores in Chain B of the docked complex.

Fig. 7

Structural analysis for the most immunogenic MHC-II restricted T-cell epitope “NEFYAYLRKHFSMMI” in 12 CnRs (a) Docking structure of MHC-I restricted T-cell epitope (b) 2D pose representation between the epitope and HLA allele showing the different non-covalent bonds (c) ERRAT Score (d) Z-Score plot (e) Ramachandran plot of the epitope allele structure showing lower energy sites of the residues in different frames and (f) Verify 3D scores in Chain A of the docked complex (g) Verify 3D scores in Chain B of the docked complex.

Fig. 8

Structural analysis for the most antigenic MHC-II restricted T-cell epitope “GCVPLNIIPLTTAAK” in 12 CnRs (a) Docking structure of MHC-I restricted T-cell epitope (b) 2D pose representation between the epitope and HLA allele showing the different non-covalent bonds (c) ERRAT Score (d) Z-Score plot (e) Ramachandran plot of the epitope allele structure showing lower energy sites of the residues in different frames and (f) Verify 3D scores in Chain A of the docked complex (g) Verify 3D scores in Chain B of the docked complex.

Docking and Z-scores of most Immunogenic and Antigenic MHC-I restricted T-cell epitopes for 12 CnRs. Docking and Z-scores of most Immunogenic and Antigenic MHC-II restricted T-cell epitopes for 12 CnRs. Structural analysis for the most immunogenic MHC-I restricted T-cell epitope “SEVGPEHSL” in 12 CnRs (a) Docking structure of MHC-I restricted T-cell epitope (b) 2D pose representation between the epitope and HLA allele showing the different non-covalent bonds (c) ERRAT Score (d) Z-Score plot (e) Ramachandran plot of the epitope allele structure showing lower energy sites of the residues in different frames and (f) Verify 3D scores in Chain A of the docked complex (g) Verify 3D scores in Chain B of the docked complex. Structural analysis for the most antigenic MHC-I restricted T-cell epitope “HPNPKGFCDL” in 12 CnRs (a) Docking structure of MHC-I restricted T-cell epitope (b) 2D pose representation between the epitope and HLA allele showing the different non-covalent bonds (c) ERRAT Score (d) Z-Score plot (e) Ramachandran plot of the epitope allele structure showing lower energy sites of the residues in different frames and (f) Verify 3D scores in Chain A of the docked complex (g) Verify 3D scores in Chain B of the docked complex. Structural analysis for the most immunogenic MHC-II restricted T-cell epitope “NEFYAYLRKHFSMMI” in 12 CnRs (a) Docking structure of MHC-I restricted T-cell epitope (b) 2D pose representation between the epitope and HLA allele showing the different non-covalent bonds (c) ERRAT Score (d) Z-Score plot (e) Ramachandran plot of the epitope allele structure showing lower energy sites of the residues in different frames and (f) Verify 3D scores in Chain A of the docked complex (g) Verify 3D scores in Chain B of the docked complex. Structural analysis for the most antigenic MHC-II restricted T-cell epitope “GCVPLNIIPLTTAAK” in 12 CnRs (a) Docking structure of MHC-I restricted T-cell epitope (b) 2D pose representation between the epitope and HLA allele showing the different non-covalent bonds (c) ERRAT Score (d) Z-Score plot (e) Ramachandran plot of the epitope allele structure showing lower energy sites of the residues in different frames and (f) Verify 3D scores in Chain A of the docked complex (g) Verify 3D scores in Chain B of the docked complex. It is to be noted that in our previous works [13], [14], with a refinement criteria of 60nt, respectively 17 and 23 conserved regions were identified with 30, 24 and 21 and 34, 37 and 29 best immunogenic and antigenic MHC-I and MHC-II T-cell and B-cell epitopes. These experiments were conducted for SARS-CoV-2 sequences till July 2020. As the virus is constantly evolving, a more recent analysis is needed to understand the evolution of the epitopes. Therefore, this work which uses sequences till January 2021 is very relevant in current scenario of constant virus mutation.

Conclusion

In the past two years, India has witnessed different surges of COVID-19 cases. Hence, it is important to provide a comprehensive bioinformatics pipeline to understand the virus evolution for identifying the mutation points as SNPs, conserved regions and design potential candidates for vaccine design. In this regard, initially, multiple sequence alignment of 4996 Indian SARS-CoV-2 genomes as a case study are carried out using MAFFT followed by phylogenetic analysis by Nextstrain to identify virus clades, resulting in 5 virus clades; 19A, 19B, 20A, 20B and 20C. Thereafter, mutation points as SNPs are identified in each clade from which top 10 signature SNPs are further identified based on their frequency in each clade. 40 unique signature SNPs are thus identified from the total 50 signature SNPs resulting in 23 non-synonymous signature SNPs which provides 28 amino acid changes in protein. These changes are visualised in their respective protein structure as well. The sequence and structural homology-based prediction of the non-synonymous signature SNPs along with their protein structural stability are evaluated to judge the characteristics of the identified clades. 40 unique signature SNPs are thus identified from the total 50 signature SNPs resulting in 23 non-synonymous signature SNPs which provide 28 amino acid changes in protein. As a consequence, A97V in RdRp in 19A, V354L in Nucleocapsid in 19B, Q57H in Nucleocapsid in 20A, R203M in Nucleocapsid in 20B while T85I in NSP2 and Q57H in ORF3a in 20C are the unique amino acid changes which are responsible for defining each clade as they are all deleterious and unstable as well as they decrease the protein structural stability. Furthermore, based on the entropy of each genomic coordinate of the aligned sequences, 473 conserved regions are identified which are then refined based on the criteria that their lengths are greater than 125nt and their BLAST specificity score as query coverage is equal to 100%. This refinement results in 12 conserved regions belonging to NSP2, NSP8, NSP10, RdRp, Exon, Spike glycoprotein, ORF3a and ORF7a proteins. Based on length, one conserved region belonging to NSP10 gene is considered to be the potential target for which the corresponding primers and probes are reported for SARS-CoV-2 detection. The 12 conserved regions are then used to identify the T-cell and B-cell epitopes along with their immunogenic and antigenic scores. Such scores are then used to select the most immunogenic and antigenic T-cell and B-cell epitopes resulting in 22 MHC-I and 21 MHC-II restricted T-cell epitopes with 10 unique HLA alleles each and 17 B-cell epitopes. Finally, the relevance of these epitopes are validated by showing the binding conformation of the MHC-I and MHC-II restricted T-cell epitopes with respect to HLA alleles. Also, the physico-chemical properties of the epitopes are reported along with the structural properties using Ramchandran plot, ERRAT scores and Z-Scores. Hence, from genetic variability to synthetic pipeline, a comprehensive bioinformatics pipeline is presented in this study to fight against SARS-CoV-2.

Ethics approval and consent to participate

The ethical approval or individual consent was not applicable.

Availability of data and materials

The aligned 4996 Indian SARS-CoV-2 genomes with the reference sequence and the final results of this work are available at ‘http://www.nitttrkol.ac.in/indrajit/projects/COVID-Pipeline-5K/”. Moreover, the SARS-CoV-2 genomes used in this work are publicly available at GISAID database..

Consent for publication

Not applicable.

Funding

This work has been partially supported by CRG short term research grant on COVID-19 (CVD/2020/000991) from Science and Engineering Research Board (SERB), Department of Science and Technology, Govt. of India.

Author contributions

Nimisha Ghosh: Formal analysis; Methodology, Coding; Visualization; Writing - original draft & editing, Indrajit Saha: Conceptualization; Data curation; Supervision; Funding acquisition; Formal analysis; Investigation; Methodology; Project administration; Resources; Validation; Visualization; Writing - review & editing, Nikhil Sharma: Methodology; Visualization; Writing - review & editing, Suman Nandi: Conceptualization; Formal analysis; Software; Validation; Visualization; Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
  15 in total

1.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.

Authors:  Kazutaka Katoh; Kazuharu Misawa; Kei-ichi Kuma; Takashi Miyata
Journal:  Nucleic Acids Res       Date:  2002-07-15       Impact factor: 16.971

2.  PROVEAN web server: a tool to predict the functional effect of amino acid substitutions and indels.

Authors:  Yongwook Choi; Agnes P Chan
Journal:  Bioinformatics       Date:  2015-04-06       Impact factor: 6.937

3.  A method and server for predicting damaging missense mutations.

Authors:  Ivan A Adzhubei; Steffen Schmidt; Leonid Peshkin; Vasily E Ramensky; Anna Gerasimova; Peer Bork; Alexey S Kondrashov; Shamil R Sunyaev
Journal:  Nat Methods       Date:  2010-04       Impact factor: 28.547

4.  COVID-19 Coronavirus Vaccine Design Using Reverse Vaccinology and Machine Learning.

Authors:  Edison Ong; Mei U Wong; Anthony Huffman; Yongqun He
Journal:  Front Immunol       Date:  2020-07-03       Impact factor: 7.561

5.  A pneumonia outbreak associated with a new coronavirus of probable bat origin.

Authors:  Peng Zhou; Xing-Lou Yang; Xian-Guang Wang; Ben Hu; Lei Zhang; Wei Zhang; Hao-Rui Si; Yan Zhu; Bei Li; Chao-Lin Huang; Hui-Dong Chen; Jing Chen; Yun Luo; Hua Guo; Ren-Di Jiang; Mei-Qin Liu; Ying Chen; Xu-Rui Shen; Xi Wang; Xiao-Shuang Zheng; Kai Zhao; Quan-Jiao Chen; Fei Deng; Lin-Lin Liu; Bing Yan; Fa-Xian Zhan; Yan-Yi Wang; Geng-Fu Xiao; Zheng-Li Shi
Journal:  Nature       Date:  2020-02-03       Impact factor: 69.504

6.  Messengers of hope.

Authors: 
Journal:  Nat Biotechnol       Date:  2020-12-29       Impact factor: 54.908

7.  Different mutations in SARS-CoV-2 associate with severe and mild outcome.

Authors:  Ádám Nagy; Sándor Pongor; Balázs Győrffy
Journal:  Int J Antimicrob Agents       Date:  2020-12-23       Impact factor: 5.283

8.  A systematic assessment of MHC class II peptide binding predictions and evaluation of a consensus approach.

Authors:  Peng Wang; John Sidney; Courtney Dow; Bianca Mothé; Alessandro Sette; Bjoern Peters
Journal:  PLoS Comput Biol       Date:  2008-04-04       Impact factor: 4.475

9.  ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins.

Authors:  Markus Wiederstein; Manfred J Sippl
Journal:  Nucleic Acids Res       Date:  2007-05-21       Impact factor: 16.971

10.  Epitope-based chimeric peptide vaccine design against S, M and E proteins of SARS-CoV-2, the etiologic agent of COVID-19 pandemic: an in silico approach.

Authors:  M Shaminur Rahman; M Nazmul Hoque; M Rafiul Islam; Salma Akter; A S M Rubayet Ul Alam; Mohammad Anwar Siddique; Otun Saha; Md Mizanur Rahaman; Munawar Sultana; Keith A Crandall; M Anwar Hossain
Journal:  PeerJ       Date:  2020-07-27       Impact factor: 2.984

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.