Literature DB >> 34931117

A novel ensemble fuzzy classification model in SARS-CoV-2 B-cell epitope identification for development of protein-based vaccine.

Zeynep Banu Ozger1, Pınar Cihan2.   

Abstract

B-cell epitope prediction research has received growing interest since the development of the first method. B-cell epitope identification with the aid of an accurate prediction method is one of the most important steps in epitope-based vaccine development, immunodiagnostic testing, antibody production, disease diagnosis, and treatment. Nevertheless, using experimental methods in epitope mapping is very time-consuming, costly, and labor-intensive. Therefore, although successful predictions with in silico methods are very important in epitope prediction, there are limited studies in this area. The aim of this study is to propose a new approach for successfully predicting B-cell epitopes for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). In this study, the SARS-CoV B-cell epitope prediction performances of different fuzzy learning classification models genetic cooperative competitive learning (GCCL), fuzzy genetics-based machine learning (GBML), Chi's method (CHI), Ishibuchi's method with weight factor (W), structural learning algorithm on vague environment (SLAVE) and the state-of-the-art ensemble fuzzy classification model were compared. The obtained results showed that the proposed ensemble approach has the lowest error in SARS-CoV B-cell epitope estimation compared to the base fuzzy learners (average error rates; ensemble fuzzy=8.33, GCCL=30.42, GBML=23.82, CHI=29.17, W=46.25, and SLAVE=20.42). SARS-CoV and SARS-CoV-2 have high genome similarities. Therefore, the most successful method determined for SARS-CoV B-cell epitope prediction was used in SARS-CoV-2 cell epitope prediction. Finally, the eventual B-cell epitope prediction results obtained for SARS-CoV-2 with the ensemble fuzzy classification model were compared with the epitope sequences predicted by the BepiPred server and immunoinformatics studies in the literature for the same protein sequences according to VaxiJen 2.0 scores. We hope that the developed epitope prediction method will help design effective vaccines and drugs against future outbreaks of the coronavirus family, especially SARS-CoV-2 and its possible mutations.
© 2021 Elsevier B.V. All rights reserved.

Entities:  

Keywords:  B-cell; Epitope; Fuzzy learning; SARS-CoV; SARS-CoV-2; Spike protein

Year:  2021        PMID: 34931117      PMCID: PMC8673934          DOI: 10.1016/j.asoc.2021.108280

Source DB:  PubMed          Journal:  Appl Soft Comput        ISSN: 1568-4946            Impact factor:   6.725


Introduction

The immune system is a network of biological processes that protect an organism from various diseases. In organisms, it is responsible for preventing infections and eliminating established infections. There are 2 types of immune systems: innate and adaptive. Innate immunity is activated when an organism such as a bacteria or a virus enters the body, and since it has no immunological memory, it cannot recognize the same pathogen when it is encountered again. Adaptive immunity comes into play in situations where innate immunity is insufficient, such as viral infection, and since it contains immunological memory, it can recognize pathogens previously encountered, so it creates an immune response more quickly [1], [2]. Antibodies in the blood and B/T white blood cells generate adaptive immune responses [3]. Because B/T lymphocyte cells contain memory, they are considered an important component of the adaptive immune system. These lymphocytes provide a protective function by producing antibodies. Antibodies are proteins that can recognize and bind biological substances called antigens. Each antibody has a compatible antigen and can only recognize it. In other words, individual antibodies are produced against each antigen [4]. B/T cells have specific receptors on their surface, and it is these receptors that enable them to recognize the antigen [5]. The part of antigens that binds to B/T cells or antibodies is called an epitope [4]. Epitopes are parts of the antigen that interact with B-cell receptors (BCRs). As seen in Fig. 1, B-cells are antibodies fighting against bacteria and viruses by creating a protein that is structured in a Y shape. B-cells recognize antigens using membrane-bound immunoglobulins (Igs). The antigen part of the B-cell that binds to the immunoglobulin or antibody is called the B-cell epitope. When antigens bind to the antibody, the B-cell is activated and proliferates. Some of the proliferating cells form plasma cells, and some form memory cells. This immunological memory provides a fast and effective response to pathogens previously encountered [6].
Fig. 1

Continues and discontinues B-Cell [7].

B-cell epitopes are of 2 types, linear (continuous) and conformational (discontinuous) (Fig. 1). Although linear epitopes are relatively few, it is important to identify these epitopes, as they consist of peptides that can be easily used in antibody production. Continuous epitope prediction is both more convenient and easier to perform for antibody production [8]. Continues and discontinues B-Cell [7]. Coronaviruses are single-stranded RNA viruses that are very common in animals. It contains 4 basic structural proteins. Of these, nucleocapsid (N), membrane (M), and envelope (E) proteins are responsible for the assembly of the virion. The spike (S) protein on the surface allows the virus to attach to the target cell [9]. Coronaviruses are divided into 4 main types: alpha, beta, gamma, and delta. Some alpha and beta species can cause respiratory tract infections in humans [10]. Severe acute respiratory syndrome coronavirus (SARS-CoV) and Middle East respiratory syndrome coronavirus (MERS-CoV) belong to the beta coronavirus family and have caused serious epidemic problems in the recent past. Due to its high similarity to SARS-CoV, this virus has been named SARS-CoV-2 by the Coronaviridae working group (CSG) of the International Committee on Virus Taxonomy [11]. The structure of SARS-CoV-2 is depicted in Fig. 2. The spike protein is the critical determinant of the SARS-CoV-2 genome, as it interacts with the host cell receptor.
Fig. 2

Structure of SARS-CoV-2 [12].

Epitope data for SARS-CoV-2 are still limited, but since the gene and protein sequences of the virus are known, epitope information can be estimated by in silico analysis, taking into account past beta coronaviruses [13]. For this purpose, the sequence similarity of SARS-CoV-2 to other coronavirus species was examined. Considering the phylogenetic tree in Fig. 3, SARS-CoV-2 is more similar to the SARS-CoV virus than MERS-CoV [14].
Fig. 3

Phylogenetic tree of beta coronaviruses [14].

Structure of SARS-CoV-2 [12]. The treatment of epidemics can be provided by community immunity or vaccination. Gaining community immunity requires a long process [15]. Since the epidemic causes loss of life and economic problems, it is important to develop an effective vaccine quickly. To develop a vaccine, it is necessary to identify protective immunogens against pathogens [16]. There is limited information as to which parts of the SARS-CoV-2 sequence are recognized by human immunity. This information is important for both vaccines and monitoring of mutations [17]. It is known that early immune responses against SARS-CoV-2 are mediated by IgM and IgA, while IgG responses are carried out 7–10 days after infection. It has been reported that developing neutralizing antibodies against the spike protein is important for an effective vaccine [18]. It has been found that directing antibodies to the receptor-binding domain and binding to spike trimers are important for long-term protective immunity against COVID-19 [19]. B-cell memory can reactivate antigen-specific responses upon re-exposure to infection [20]. Identifying conserved epitope regions would be useful in generating resistant immunity not only for the SARS-CoV-2 outbreak but also for ongoing virus evolution. Due to their strong immune responses, in vaccine studies developed for SARS-CoV and MERS, S, N, and M proteins were preferred as antigens [21], [22], [23]. Phylogenetic tree of beta coronaviruses [14]. Epitope identification with traditional methods is performed by experimental techniques, but this is a costly and time-consuming process [17]. For epitope identification by traditional methods, it is necessary to experimentally confirm by in vitro methods whether all possible subsequences in the protein sequence are antigens. With in silico approaches, subsequences that are less likely to be epitopes are eliminated using bioinformatics tools and historical data. Thus, the cost is reduced by reducing the number of epitope candidates that need to be examined by in vitro methods. [24]. Hereby, the determination of protein regions that are likely to be epitopes contributes to candidate vaccine and drug studies by narrowing the search space for epitopes. The aims of this study are to compare the prediction performance of fuzzy learning models for the prediction of epitope regions and to propose a new ensemble method for determining the epitope region by in silico analysis. This study presents the following contributions and novelty: To determine SARS-CoV and SARS-CoV-2 B-cell epitope regions by in silico analyzes; To compare the predictive success of fuzzy learning models in identifying SARS-CoV B-cell epitope regions; To propose a novel ensemble approach to successfully predict SARS-CoV-2 B-cell epitope regions; To contribute to the development of new protein-based vaccines against SARS-CoV-2 and future epidemics by identifying epitope regions; To propose epitope candidates to assist biologists in developing a rapid and successful vaccine; It has been shown that more successful results are obtained with the proposed ensemble method compared to other studies in the literature; A statistically significant and robust ensemble approach has been proposed to identify SARS-CoV and SARS-CoV-2 B-cell epitope regions.

Related works

From the beginning of the SARS-CoV-2 epidemic, a lot of work has been done in this area and continues to be done. Most of the studies in this field are related to case/death estimation [25], [26], [27], [28], detection of COVID-19 from medical images [29], [30], [31] or estimation of the number of vaccinated people [15]. There is a gap in the literature on the determination of epitope regions by in silico methods, which will be very beneficial to science and health services in both vaccine and drug development against the SARS-CoV-2 epidemics and future epidemics. Authors in [32], in their study on epitope prediction, stated that 2 different ways were followed for epitope prediction by in silico analysis. The first of these is prediction methods based on SARS-CoV immunological data due to its genetic similarity, and the other is peptide binding prediction methods. The authors reviewed studies with both methods and predicted epitopes. Authors compared the epitopes obtained by in vitro methods with the epitopes estimated by in silico analysis, and they found that the methods using SARS-CoV immunological data, in general, coincided with the experimental results. Within the scope of the study, B-cell epitope prediction was made for SARS-CoV-2 from SARS-CoV immunological data due to genetic similarity. For this reason, the prediction studies based on SARS-CoV immunological data have been examined within the scope of the literature. In this context, there are a limited number of studies in the literature. In some of the SARS-CoV-based studies, the sequence alignment results of SARS-CoV and SARS-CoV-2 were evaluated with bioinformatics tools to identify candidate epitopes. Some researchers have made predictions using the immunological data of SARS-CoV epitopes. Nucleocapsid and spike proteins are the dominant structural proteins in the SARS-CoV-2 genome, as in other beta coronaviruses. In studies on the vaccine, it has been shown that spike protein is effective for developing a peptide vaccine and is a good candidate for generating a B-cell-dependent immune response [33]. In studies in which the nucleocapsid protein was experimentally tested for SARS-CoV, it was observed that it was the dominant protein expressed in the virion in the early stage of infection [34]. Studies have shown that the nucleocapsid protein is a strong T-cell-dependent immunogen [35]. Therefore, peptide-based studies on the immune response have focused on these 2 proteins. Within the scope of the study in [33], 34 linear B-cell epitopes, 29 MHC I, and 8 MHC II T-cell epitopes were shown as candidates for the vaccine. Authors in [17] utilized the bioinformatics tools provided in the Immune Epitope Database (IEDB) and Virus Pathogen Resource (ViPR) to identify regions corresponding to SARS-CoV-2 sequences and to predict possible epitopes. IEDB is a database containing epitope information compiled from scientific literature for infectious disease, allergy, and autoimmunity. It also includes online bioinformatics tools to analyze epitope data and predict potential epitopes [36]. ViPR, on the other hand, is a database containing genome, gene, and protein sequence information about human pathogenic viruses [37]. In the related study, considering the conserved regions of SARS-CoV-2, B and T-cell epitope estimation for SARS-CoV-2 was realized based on sequence features. BepiPred 2.0 tool [4] was used for linear B-cell epitope prediction and Discotope 2.0 tool [38] for conformational B-cell epitope prediction. 29 epitopes for the spike protein, 4 for the nucleocapsid protein, and 3 for the membrane protein were identified as candidates. In another study based on sequence features, Chen et al. [39] aimed to predict linear and conformational B and T-cell epitopes in the spike and nucleocapsid proteins of SARS-CoV-2. They identified the conserved regions of the virus genome by aligning the SARS-CoV-2 protein sequences obtained from the NCBI database with the Clustal Omega bioinformatics tool. Linear B-cell epitope prediction was performed with the BepiPred and ABCPred [40] tools and the epitope sequences with the highest antigenicity found were listed. Authors measured antigenicity values with the Vaxijen 2.0 [41] server. Conformational epitope prediction was performed with Discotope 2.0. Estimation of T-cell epitopes within the nucleocapsid protein that binds to the HLA-1 or HLA-2 molecule was made with the free online tool provided by IEDB. 63 B-cell epitopes have been proposed for vaccine studies. At [42], authors performed B and T-cell epitope identification on spike protein. They used NetCTL 1.2 for T-cell epitopes, ElliPro and RaptorX for conformational B-cell epitopes, and BepiPred and ABCPred servers for linear B-cell epitopes. As a result of their analysis, they found 5 T-cell epitopes, 4 linear B-cell epitopes, and 5 conformational B-cell epitopes. In the study [43], 115 T-cell epitopes and 298 B-cell epitopes were obtained from the NIAID and VIPR [37] databases, which were experimentally validated for SARS-CoV. These epitopes were aligned with the SARS-CoV-2 protein sequence and conserved and unmutated regions were identified. Accordingly, 27 T-cell epitopes and 42 linear B-cell epitopes for nucleocapsid and spike proteins, have been shown as candidates for SARS-CoV-2. Sarkar et al. [44] identified possible B and T-cell epitopes using IEDB for spike, nucleocapsid, ORF3a, and membrane proteins. Among these epitopes, those with high antigenicity, non-allergenicity, and non-toxicity were identified as candidate epitopes for SARS-CoV-2. The authors suggested 5 epitopes for spike protein and 6 epitopes for nucleocapsid protein for vaccine studies. The authors [45] predicted B and T-cell epitopes in spike, nucleocapsid, and membrane proteins of SARS-CoV-2 by an immunoinformatic method. Since B-cell epitopes can bind to antigen receptors on the B-cell surface, they eliminated intracellular epitopes from the epitopes they found with the BepiPred and BcePred servers. By measuring the antigenicity, allergenicity, and toxicity values for the remaining epitopes, they identified 10 B-cell linear epitopes with antigenicity greater than 0.9 as candidates. In another immunoinformatics study [46], B-cell epitopes for spike protein were predicted with BepiPred 2.0. Those with a threshold value higher than 0.5 were also used for T-cell epitope prediction. Among the peptides found, the allergic and toxic ones were eliminated and 17 B-cell epitopes were presented as candidates. Rehman et al. [47] focused on predicting immune response inducing epitopes in B and T-cells for multi-epitope vaccine design. Epitope prediction was performed with spike, Mpro, Nsp 12, and Nsp 13 proteins of SARS-CoV-2. As a result of the study, 46 antigenic B-cell peptides were predicted for the spike protein. In [48], authors proposed a method to classify T-cell responses by analyzing TCR beta information from subjects infected and uninfected with SARS-CoV-2. The proposed method aimed to detect protective immunity acquired through natural infection or vaccine-induced immunity. Principal Component Analysis (PCA) and Hierarchical Clustering methods were applied to the sequence data separated into k-mers. Since the number of samples in the used dataset is small, the dataset is divided with hold-one-out. Accordingly, an accuracy value of 96% was obtained in the training data and 92.9% in the test data. The procedures were repeated for k-mers with a length of 3–9 amino acids, and the k-mers length with the highest success was determined as 5. The fact that the number of samples in the training dataset is too small has caused a situation that is overfitting to the training data. This situation reduces the generalizability of the proposed method. Lee and Koohy [24] extracted T-cell peptides identified for SARS-CoV and peptides with high immunogenicity from IEDB. By aligning these peptides with those of SARS-CoV-2, they identified peptides with high sequence similarity as candidate peptides. MHC peptide connectivity of candidate peptides was measured with netMHCPan [49] and immunogenicity with iPred [50], high-value peptides are listed for vaccine studies. Authors in [51] performed B-cell linear epitope prediction for SARS-CoV using an immunological epitope dataset [52] created with IEDB and UniProt. The authors made classification with Bayesian Neural Network, which is also used for uncertainty modeling in deep learning, with the thought that measuring uncertainty will also provide a measure for the reliability of the model. They achieved 85% accuracy in SARS-CoV data. Aleatoric and epistemic uncertainty methods were used to measure the uncertainty in epitope estimation. The related study was applied only for SARS-CoV epitope prediction, no prediction was made for SARS-CoV-2. Noumi et al. [53] applied the Long Short Term Memory (LSTM) network with attention mechanism for epitope prediction in the IEDB dataset [52]. The results found were compared with the epitope sequences predicted by BepiPred 2.0 for the same protein sequences. The epitope peptide length is limited to 8–14 amino acids. The highest accuracy value was obtained as 0.79 for the case where the peptide length is 12. In another study [54] on the IEDB epitope dataset [52], authors made epitope prediction for SARS-CoV by using immunological data with various machine learning methods. The authors used the dataset containing B-cell epitopes to develop the model and tested it with the SARS-CoV dataset. The most successful result was obtained with an accuracy of 87% with the ensemble learning model. The coronavirus pandemic has proven that the World is not prepared for deadly viral outbreaks. Traditionally, it takes 15 or more years to develop a vaccine [55]. Thanks to in silico and computational methods, vaccine candidate epitopes can be successfully reduced, accelerating biologists in emergencies and epidemics. However, there is a gap in the literature on successful SARS-CoV-2 epitope prediction. The motivation of this study is to contribute to biologists in vaccine development by rapidly identifying a small number of vaccine candidate epitopes using in silico and bioinformatics tools.

Material and methods

In this study, we used the publicly available Kaggle dataset of SARS-CoV epitopes and SARS-CoV-2 peptides for the prediction of epitope regions. A novel ensemble fuzzy classification model was proposed for the successful prediction of epitope regions. R programming language was used for the development of fuzzy learning models and statistical analysis. Fuzzy rule-based classification systems (FRBCSs) belong to the soft computation family and are considered an effective approach to model complex problems. FRBCSs are specialized fuzzy rule-based systems and are used for handling classification problems. FRBCSs provide an interpretable model through the use of linguistic tags in their rules. The general framework of the proposed model is formulated and given in Fig. 4. To train fuzzy methods, the labeled SARS-CoV dataset is used. To get statistical validity, the dataset was divided into train and test sets 6 times using random sampling with replacement. The fuzzy rule sets obtained during the training phase were applied to the test sets and their performances were measured. Five fuzzy methods (GBML, GCCL, CHI, SLAVE, W) were applied to all training sets separately and the final decisions for the relevant test set were made by the majority voting method in the individual decisions of these 5 methods. In this way, it is obtained one ensemble model for each train set. The prediction was made using all models with unlabeled SARS-CoV-2 data, and the class of each peptide was obtained by combining the decisions of ensemble methods.
Fig. 4

General framework of the ensemble fuzzy classification model.

The datasets used in this study are described in Section 3.1, the FRBCS methods are briefly explained in Section 3.2, the proposed model is given in Section 3.3, and the evaluation metrics are described in Section 3.4. General framework of the ensemble fuzzy classification model.

Dataset description

In this study, a dataset [52] containing B-cell epitopes obtained from IEDB and UniProt was used to predict SARS-CoV-2 epitopes. There are 3 datasets here: B-cell epitopes, SARS-CoV epitopes and SARS-CoV-2 peptides. Of these, B-cell epitopes and SARS-CoV epitopes are labeled data, and the SARS-CoV-2 dataset contains peptides of various lengths that are identified from a protein sequence and have no label information. In this study, due to the high genome similarity, the SARS-CoV epitope dataset was used to develop a fuzzy model. The model with the high test set accuracy for SARS-CoV was also applied to the SARS-CoV-2 dataset, and epitope prediction was made. The datasets contain 13 features. The SARS-CoV and B-cell datasets also have target values, indicating whether an amino acid peptide is capable of inducing antibodies. The proteins in the datasets are immunoglobulin antibody proteins, as they are the most common type of antibody found in the bloodstream. The dataset includes protein and peptide sequences, protein IDs, starting and ending positions of peptides in the protein sequence, and protein/peptide-based features. B-cell epitope prediction is based on the antigenicity, hydrophobicity, surface accessibility, beta turns, and flexibility properties of epitopes. The features in the dataset and their descriptions are given in Table 1.
Table 1

Peptid and protein based features at datasets.

FeatureDescription
Chou–FasmanPeptide feat. Relative frequency analysis on the basis of amino acids for tertiary structural elements. Given here for B-Turn.

EminiPeptide feat, relative surface accessibility, a measure of residue solvent exposure.

Kolaskar–TongaonkarPeptide feat. Antigenicity, antigenic propensity of residues.

ParkerPeptide feat. A measure of hydrophobicity of peptide.

Isoelectric-pointProtein feat. pH value of the amino acid in an electric field.

AromaticityProtein feat. A factor for protein fragment solubility.

HydrophobicityProtein feat. A measure of the degree of affinity between water and the side chain of an amino acid.

StabilityProtein feature.
There are a total of 520 samples in the SARS-CoV dataset. A total of 140 of them are in the positive class, and the remaining 380 samples are in the negative class. Positive class means that the corresponding peptide is the epitope. The longest peptide is 393 amino acids long, and the shortest peptide is 5 amino acids long. The SARS-CoV-2 dataset includes 20312 samples, and the peptides are 5–20 amino acids long. Peptid and protein based features at datasets.

Fuzzy learning classification models

The genetic cooperative competitive learning (GCCL) [56] algorithm uses genetic cooperative competitive learning to handle classification problems. In this technique, a chromosome defines each linguistic IF–THEN rule using integers as the representation of the previous part. The heuristic is applied to automatically produce the class in the consequence part of the fuzzy rules. Assessment is calculated separately for each rule. Thus, performance is not based on the whole rule set. The fuzzy genetics-based machine learning (GBML) model [57] is based on a hybridization of Ishibuchi’s genetic collaborative competitive learning (GCCL) and Pittsburgh approaches. Selection, crossover, and mutation operators of the genetic algorithm are applied according to the algorithm proposed by Pittsburgh. Here, each rule set is treated as an individual. Then, GCCL steps are applied to each of the created rule sets with a probability specified as a parameter in the algorithm. Good fuzzy rules are found efficiently with the GCCL approach. Chi’s (CHI) method [58] is proposed to overcome classification problems and is an extension of Wang and Mendel method. This algorithm is similar to the technique of Wang and Mendel’s [59]. Chi’s method generates fuzzy IF–THEN rules and then replaces them with class labels so that they are sequential parts. Regarding the calculation of the degrees of each rule, they are identified by the previous (antecedent) part of the rules. Redundant rules can be eliminated according to their degree. Thus, fuzzy IF–THEN rules based on the FRBs model are obtained. Calculation of the degree of each rule is determined by the antecedent part of the rules. Ishibuchi’s method with a weight factor (W) applies the second type of FRBs, which has weights in consequent parts of the rules [60]. The antecedent fragments are then determined from the training data by a grid-type fuzzy partition. The resulting class is defined as the dominant class in the fuzzy subspace corresponding to the antecedent part of each fuzzy IF–THEN rule. The class of a new instance is determined by the rule’s resulting class, which is the maximum product of its compatibility and precision. The degree of concordance is determined by summing the degrees of the membership functions of the previous sections, while the degree of precision is calculated from the ratio between the next class. The structural learning algorithm in a vague environment (SLAVE) is based on an approach where only one fuzzy rule is obtained each time the genetic algorithm is run. To remove unrelated variables in a rule, SLAVE has a two-part structure: the first part demonstrates the relevance of the variables, and the second part describes the values of the parameters. This method applies binary codes as representative of the population and executes basic genetic operators, i.e., crossing, selection, and mutation on the population. Then, the best rule is identified as the rule with the highest degree of integrity and consistency [61].

Proposed ensemble fuzzy classification approach

An ensemble fuzzy classifier technique is proposed to develop models and make predictions on the SARS-CoV dataset. Five different fuzzy methods were applied separately to the training data created by random sample selection. The proposed ensemble model combines the decisions classifiers GCCL, GBML, CHI, W, and SLAVE by using a majority voting scheme. As shown in Fig. 5, with the model developed with each of them, predictions were made on the test dataset consisting of random samples. By combining the decisions of the models, the class of each sample in the test set was decided by the majority voting method.
Fig. 5

The proposed ensemble fuzzy classification model.

By random sampling with replacement, the training and test set creation process was repeated 6 times. The performance of the developed system was measured by applying the proposed ensemble fuzzy model to each training-test dataset and taking the average. The proposed ensemble fuzzy classification model. Based on the high genome similarity of SARS-CoV and SARS-CoV-2, a fuzzy model was created with SARS-CoV data, and epitope prediction was made by giving unlabeled SARS-CoV-2 data to these models as test data. Since the fuzzy methods used are heuristic, 6 training-test sets were created by random sampling from all SARS-CoV data, and an ensemble model was obtained by training each training set with five different fuzzy methods (modelChi, ModelGBML, modelGCCL, modelSlave, modelW in Fig. 5). SARS-CoV prediction successes were measured by majority voting for each training-test set pair. Since SARS-CoV data were divided into 6 training-test sets, epitope prediction was made by applying six models consisting of SARS-CoV-trained models of five fuzzy methods to SARS-CoV-2 data. Each yellow box in Fig. 6 contains the training and model building processes shown in Fig. 5. The decisions made by the models are combined with different degrees of precision. The final epitope decision-making strategies were named 4V (at least 4 votes), 5V (at least 5 votes), and 6V (at least 6 votes). The epitope prediction process for SARS-CoV-2 is shown in Fig. 6.
Fig. 6

Prediction process for SARS-CoV-2.

The details of the parameter settings for each model are given in Table 2. All specified parameters were determined experimentally. In the proposed method, the labeled SARS-CoV dataset was used to develop a model with fuzzy methods. Since the problem under consideration is a classification problem, the degrees of the rules are determined in all methods depending on how much they represent the data during training. That is, the degree of membership is directly proportional to the fact that the rule represents the training data. Membership functions are defined with the ‘type.mf’ parameter specified in Table 2. The main difference between the methods is the way in which the learning and fuzzy rules are created.
Table 2

The parameters of fuzzy models.

ParameterDescriptionValue
popu.sizePopulation sizeGCCL:30, GBML:10
num.classNumber of classesFor all methods:2
num.labelsNumber of linguistic termsW:11, CHI:5, GCCL:9, GBML:7, SLAVE:7
persen_crossProbability of crossoverGCCL:0.8, GBML:0.9, SLAVE:0.8
persen_mutantProbability of mutationGCCL:0.4, GBML:0.2, SLAVE:0.4
max.genMaximum number of generationsGCCL:150, GBML:10, SLAVE:40
type.mfThe type of the shape of the membership functionW:Gaussian, CHI:Triangle
type.tnormThe type of the tnormW:min, CHI:min
type.snormThe type of the snormW:sum, CHI:max
type. implication. funcType of implication functionsW:Dienes Recher, CHI:Zadeh
max.num.ruleMaximum number of rulesGBML:10
p.dcareA probability of “don’t care” attributes occurredGBML:0.5
p.gcclA probability of GCCL process occurredGBML:0.4
max.iterMaximum number of iterationsSLAVE:30
k.lowerA lower bound of the noise thresholdSLAVE:0
k.upperA value between 0 and 1 representing the level of generalizationSLAVE:0.8
Prediction process for SARS-CoV-2. The parameters of fuzzy models.

Evaluation metrics

The epitope prediction performances of the models on the SARS-CoV dataset were compared according to accuracy rate, error rate, sensitivity/recall rate (RR), specificity rate (SR), positive predictive value (PPV) and negative predictive value (NPV) criteria. These metrics are calculated from the confusion matrix. The confusion matrix or contingency table summarizes the performance of a classification model. The accuracy rate is the ratio of all correctly predicted epitopes to the total number of epitopes. The error rate is the ratio of all incorrectly predicted epitopes to the total number of epitopes. The sensitivity or recall rate (RR) metric is a measure of how well a test identifies true positives. RR is the ratio of the true epitopes over all the actual epitopes. The SR is the ratio of the true non-epitopes over all the actual non-epitopes. PPV is the ratio of all the true epitopes over all the predicted epitopes. NPV is the ratio of the true non-epitopes over all the predicted non-epitopes. These metrics are expressed mathematically as follows: In the equations, TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives, respectively. Furthermore, the eventual B-cell epitope prediction results obtained for SARS-CoV-2 were compared with the epitope sequences predicted by the BepiPred server for the same protein sequences. BepiPred is a web server that predicts B-cell epitopes from antigen sequences (http://www.cbs.dtu.dk/services/BepiPred/). BepiPred makes predictions with a model trained using Random Forest on a dataset of 649 antigen–antibody crystal structures. The Vaxijen server was used to compare the SARS-CoV-2 epitopes predicted by the BepiPred server and presented in the literature with the epitopes found by the proposed method. ( http://www.ddg-pharmfac.net/vaxijen/VaxiJen/VaxiJen.html). This antigenicity measurement tool is used to analyze B and T-cell epitopes by evaluating the physical and chemical properties of amino acids and their abundance in known B and T-cell epitopes. The higher the epitope antigenic score, the more likely it is to be used as an antigen, i.e., it has greater immunogenicity.

Experimental results

Dataset preprocessing

In the SARS-CoV and SARS CoV-2 datasets, different peptides were identified for a single IgG protein with lengths of 1255 and 1281 amino acids, respectively. Therefore, parent protein ID and protein-based features were the same for all data, so these features were excluded from both datasets. Additionally, the features that give the start and end positions of the peptide and peptide sequence features were also removed, and a new feature including the peptide length was added. As a result, the datasets were arranged to contain a total of 5 features. There is also a label feature for SARS-CoV. The correlation matrix of independent variables of the SARS-CoV dataset, density plots, and 2D density charts are given in Fig. 7.
Fig. 7

Correlation, density and 2D density plot of independent variables.

In Fig. 7, the lower triangle shows the 2D density of the combination between the two variables. The Pearson correlation is given on the upper triangle, and the variable distributions are illustrated on the diagonal. Linear dependence between two variables was measured with Pearson correlation. In the upper triangle, both the correlation coefficient and the correlation significance level are given (***P 0.001, **P 0.01, *P 0.05). When the scatter plots are examined, it is seen that the variables are normally distributed and the highest and most significant correlation is between Parker and Chou–Fasman features (R 0.67, P 0.001). Correlation, density and 2D density plot of independent variables.

SARS-CoV prediction

SARS-CoV dataset is unbalanced in terms of label distribution. Of the 520 samples in the dataset, 140 are in the positive (epitope) class, while the remaining 380 are in the negative (non-epitope) class. The training, and test sets are divided according to the class information. Out of 140 samples in the positive class, 120 samples were randomly selected for training so that the model could learn the data and the remaining 20 samples were used for testing. Since there were few samples in the positive class, it was observed that the model could not learn the positive class when the number of samples included in the test set was increased. Therefore, 20 samples were randomly selected from the negative class so that there were equal numbers of samples from both classes in the test set. For the model to learn the classes correctly, it was decided experimentally how many samples from the negative class should be present in the training set. There were 120 samples from the positive class in the training set. If there are 360, 300, 240, 180, and 120 samples from the negative class, the estimation error according to the classes and the total estimation error of the proposed ensemble fuzzy model are given in Fig. 8.
Fig. 8

Train size tuning for negative class samples.

As shown in Fig. 8, when the number of samples for the negative class was higher than that for the positive class, high prediction accuracy was obtained for the negative class, but the model could not recognize the positive class. In the training set, as the number of samples started to be equally distributed according to the classes, the model’s ability to correctly predict the positive class increased, and the total error decreased. From this point of view, the training set was created to have 120 samples from both classes. Train size tuning for negative class samples. The training-test set creation process was repeated 6 times to include random samples. Accordingly, 20 randomly selected out of 140 samples in the positive class were allocated as testing, and the rest were allocated as training. For the negative class, 20 randomly selected out of 380 samples were added to the test set, and 120 randomly selected samples were added to the training set. In Table 3, the individual decisions of fuzzy methods and the prediction errors obtained by the proposed method are given for the test sets. When the individual decisions of fuzzy methods are examined, they are insufficient on their own for defining membership functions that can model the whole data. It is clear that combining the decisions of fuzzy methods has resulted in a significant improvement in prediction performance. This is because each method learns different properties in the data. The proposed ensemble fuzzy model classifies SARS-CoV data with an average accuracy of 91.7%. The Wilcoxon rank-sum test was applied to SARS-CoV results to measure the statistical significance of the difference between the individual methods and the proposed method results. The test results are given in the last column of Table 3 for 0.05. ‘’ indicates that the results of the proposed method are statistically better than those of the corresponding algorithm.
Table 3

Classification errors for SARS-CoV.

MethodTest1Test2Test3Test4Test5Test6Avg.Sig
CHI17.537.52527.522.52029.17+
GBML17.51532.5152022.523.82+
GCCL22.5553012.53032.530.42+
SLAVE27.517.57.52522.522.520.42+
W52.552.552.527.54547.546.25+
Ensemble fuzzy7.512.57.5107.558.33
Classification errors for SARS-CoV.

SARS-CoV-2 prediction

Considering the high genome similarity of SARS-CoV with SARS-CoV-2, epitope prediction was made for SARS-CoV-2 with fuzzy models trained for SARS-CoV, as shown in Fig. 6. While estimating with SARS-CoV data, the model was trained 6 times since there were 6 training-test sets created with randomly selected samples. Each of these models was also applied to the SARS-CoV-2 data to make predictions. The ensemble fuzzy classification method makes predictions by combining the decisions of 5 fuzzy classifiers. This process was repeated 6 times to ensure statistical validity. Therefore, 6 models were formed. With each model, the prediction was made on all SARS-CoV-2 data. Unlike the method applied for SARS-CoV, the decision of the models is combined with different degrees of sensitivity; common decision of at least 4 models (4V), common decision of at least 5 models (5V) and common decision of all models (6V). Accordingly, for a peptide in the dataset to be labeled as an epitope by the 4V method, at least 4 out of 6 models must have made an “epitope” decision for that peptide. Table 4 gives the number of peptides labeled as epitopes for each method and their lengths. Additionally, the “dataset” column shows how many peptides the data include for each length.
Table 4

Prediction results for SARS-CoV-2.

EpitopeNumber of predicted epitopes
LengthDataset4V5V6V
51277776642325
6127620113670
7127522915767
8127426518581
9127328719488
10127229419682
111271313206106
12127033021998
131269321226123
141268321232121
151267329221121
161266335229138
171265345244145
181264369273144
191263369269139
201262381282156

Total20 312546539112004
The SARS-CoV-2 dataset contains all possible k-mers of the spike protein that are 5–20 amino acids long. The proposed ensemble fuzzy classification model labeled 5465 peptides with the 4V method, 3911 peptides with the 5V method, and 2004 peptides with the 6V method as the epitope. The predicted epitopes for all three methods are listed in Appendix Table A.1. Prediction results for SARS-CoV-2. The algorithm was executed on an Intel(R) Core (TM) i7-6700 HQ CPU at 2.60 GHz, on a 64-bit architecture with 16 GB RAM, running Windows 10 and the R programming language using the frbs package. The execution time results are given in Table 5. As mentioned earlier, the SARS-CoV dataset was divided into 6 training sets by random sampling. The training model in each row represents execution time for related training set for all methods. A separate model was created for all fuzzy methods in each training set. All models obtained with a training set were estimated by giving SARS-CoV-2 data as a test set. For this reason, model creation and prediction times are given separately for each training set. The ‘prediction’ column is the time required to make predictions in the SARS-CoV-2 data for models trained with the relevant training set. The last column is the time required to train model with relevant training set and to make prediction. The time required to train all models and make predictions for all training sets is given in the last line. The final decision for the SARS-CoV-2 data was obtained by estimating all models from all training sets. Since GCCL, GBML and SLAVE are genetic algorithm-based iterative methods, their run times are greater than those of CHI and W.
Table 5

CPU time for SARS-CoV-2 prediction.

TrainSetGCCLWCHIGBMLSLAVEPredictionTotal
(min)(s)(s)(min)(min)(min)(min)
Train11.730.070.034.254.579.5520.1
Train21.780.060.034.334.619.0819.8
Train31.440.060.044.314.348.9219.33
Train41.800.060.034.284.549.0719.69
Train51.860.060.034.354.469.0119.68
Train61.840.060.034.274.499.0719.67

CPU time of whole framework118.27
Since SARS-CoV-2 is unlabeled, the selected epitopes were compared with the epitopes that the BepiPred server found for the same protein. In addition, these results have been compared with epitopes found in studies with various bioinformatics tools or in vitro methods in the literature. The BepiPred server identified 44 peptides for the spike protein, 1–36 amino acids long. For the same protein sequence, in [33] 34, in [17] 29, in [39] 63, in [42] 4, in [43] 21, in [44] 5, in [46] 17, in [47] 46, and in [45] 10 linear B-cell epitopes were identified as vaccine candidates. CPU time for SARS-CoV-2 prediction. The peptides in the SARS-CoV-2 dataset are 5–20 amino acids long. Epitopes shorter than 5 amino acids or longer than 20 amino acids among the epitopes compared in BepiPred and the literature were not included in the comparison. In addition, some peptides identified in these studies were not included in the comparison because they were not included in the SARS-CoV-2 dataset used. After elimination, comparisons were made for different sensitivities (4V, 5V, 6V) of the proposed method. Comparative results for BepiPred are given in Table 6, and comparative results for the literature are given in Table 7. The first column in the tables includes peptides BepiPred or found in studies in the literature. Other columns are the sequence lengths of those peptides (Len) and antigenicity scores (Ant) measured by Vaxigen 2.0. Comparison results, peptide lengths and antigenicity scores of the proposed method for different sensitivity levels are given in the next columns. The Detection column (Det) indicates whether a peptide is found by the proposed method. A peptide identified by related studies is also marked “✓” if it is a subsequence of a peptide found by the ensemble fuzzy method. Those with high antigenicity scores are written in bold. The mean antigenicity values of the peptides found by the methods are given in the “Average” line. If the antigenicity score of a peptide could not be measured with the Vaxigen tool, it is indicated as “NA”.
Table 6

Comparison with BepiPred results.

BepiPred
4V
5V
6V
EpitopeLenAntDetLenAntDetLenAntDetLenAnt
VNLTTRTQLPPAYTNSFTR190.6285190.6285190.6285
ASTEKS60.620670.858770.858770.8587
PFLGVYYHKNNKSWMESE180.5664180.5664180.5664180.5664
KHTPINLVRDLPQGFSA170.6207190.5535
TPGDSSSGWTA110.2473120.0746170.4892170.4892
IYQTSNFRVQP111.0147120.9986120.9986160.8559
DEVRQIAPGQTGKIAD161.0388161.0388161.0388191.1515
NNLDSKVGGNYN120.7538150.7275150.7275150.7275
GFNCYFPLQSYGF130.8519180.8567180.8567180.8567
SNKKFLPF81.395281.395281.395291.1432
NCTEV5NA5NA5NA
HADQLTPT80.417780.417780.4177160.6093
RVYSTGSNVFQ11−0.1000130.3359130.3359140.1826
AYTMSLGAENSVAYSNN170.5966170.5966170.5966170.5966
KQIYKTPPIKDFGGF15−0.389615−0.389615−0.389615−0.3896
LPDPSKPSKR100.2641100.2641100.2641100.2641
DPPEAEVQI90.5966100.4955100.495511−0.0004
GQSKRVDFC91.7790111.4088121.3607121.3607
FYEPQIITTD100.4179100.4179160.6504190.2751
VNNTVYDPLQPELDSF160.2201160.2201160.2201190.1493
LGKYEQYIKGSGR130.3101130.3101130.3101130.3101

Average0.59250.59000.62530.5553
Table 7

Comparison results with literature.

Predicted Epitopes in [17]
4V
5V
6V
EpitopeLenAntDetLenAntDetLenAntDetLenAnt
FHAIHVSGTNG110.8882180.6317180.6317180.6317
TLDSKTQSLLIVNNATNV180.7295
PGDSSSGWTAGA120.0820170.2386170.2386180.5144
NENGTITDA90.525790.5257100.6020110.7882
IYQTSNFRV90.3109110.2839110.2839160.8559
IAWNSNNLDSK111.2444111.2444120.9178130.7773
STEIYQAGSTPCNGV15−0.0513160.053918−0.075119−0.0745
RVYSTGSNVFQTRA140.3248140.3248140.3248180.5620
GAEHVNNSYE100.8739100.8739100.8739100.8739
YICGDSTECSNLLLQ15−0.0093
GSFCTQLNRALTG130.4763
AVEQDKNTQE100.2792120.5008120.5008
DEMIAQYTSALLAG140.1366
LQSLQTYVT9−0.0592
RASANLAATKMSECVLGQ180.4001
TDNTFVSGNCD110.0820140.1793140.1793140.1793
KNHTSPDV80.900680.900680.900680.9006
GINASVVNIQ101.0425
EVAKNLNESL10−0.043210−0.0432140.1512

Average0.45140.47620.46080.6009

Predicted Epitopes in [39]4V5V6V

EVRQIAPGQTGKIADY161.3837161.3837171.0936191.1515
TVEKGIYQTSNFRVQP160.6733160.6733160.6733
HRSYLTPGDSSSGWTA160.6017160.6017170.4892170.4892
YVGYLQPRTFLLKYNE160.5108180.4816180.4816180.4816
CGPKKSTNLVKNKCVN160.2006200.8935200.8935
TKTSVDCTMYICGDST160.0937180.1426180.1426
TEIYQAGSTPCNGVEG16−0.010516−0.010516−0.0105180.0583
FERDISTEIYQAGSTP16−0.290417−0.138317−0.138319−0.0782
FAMQMAYRFNGIGVTQ161.3096181.4137
IGKIQDSLSSTASALG160.654190.5712190.5712200.4992
LQSYGFQPTNGVGYQP160.5258170.4203170.4203
SWMESEFRVYSSANNC160.1724160.1724160.1724160.1724
TRFQTLLALHRSYLTP160.5115180.5595
PQIITTDNTFVSGNCD160.2404160.2404160.2404160.2404
QKEIDRLNEVAKNLNE160.0684180.1255180.1255
KQIYKTPPIKDFGGFN16−0.224116−0.224116−0.224116−0.2241
SKRVDFCGK91.732191.7321121.3607121.3607
GKYEQY61.282161.282161.282161.2821
LDSKVGGNYNYLY130.8331140.8329140.8329140.8329
TPGDSSSGWTAGA130.1212180.5144180.5144180.5144
FLPFQ5NA81.442781.442791.1432
TSNFRVQPTE101.3571111.2323111.2323111.2323
TNLCPF61.250880.8906131.04
DPSKPSKRSF100.8148100.8148100.8148110.6286
EVFNATRFASVYAWNRKRI190.2655190.2655
AEVQIDR7−0.43558−0.281411−0.000411−0.0004
PTNGVG6−1.14417−0.72788−0.31128−0.3112
QLTPTWRVYSTGSNVFQTRA200.7725200.7725200.7725
TMSLGAENSVAYSNNS160.6687160.6687160.6687160.6687
GFNCYFPLQSY110.9224180.8567180.8567180.8567
EPQIITTDNT100.7545130.6684160.5227170.3342
NSYECDIPIG100.6533110.8366110.8366140.9296
IYKTPPIKDFGGFNF150.0696150.0696150.0696150.0696

Average0.51060.57630.61140.5362

Predicted Epitopes in [44]4V5V6V

LTPGDSSSGWTAG130.4950180.3768180.3768180.3768
VRQIAPGQTGKIAD141.2606151.3487161.0388191.1515
YQAGSTPCNGV110.0881130.1909150.2479150.2479
ILPDPSKPSKRS120.5322120.5322120.5322120.5322

Average0.5940.61210.54890.5771

Predicted Epitopes in [43]4V5V6V

DVVNQNAQALNTLVKQL170.0320
EAEVQIDRLITGRLQSL17−0.178420−0.0881
GAGICASY80.5210130.6871130.6871170.4587
GSFCTQLN80.814490.930690.9306
KGIYQTSN80.244180.244190.4627100.3992
AMQMAYRF80.977691.0278111.2909111.2909
KNHTSPDVDLGDISGIN171.1116181.0631190.8800190.8800
AATKMSECVLGQSKRVD170.6159

PFAMQMAYRFNGIGVTQ171.3306181.4137
QALNTLVKQLSSNFGAI170.0872
QLIRAAEIRASANLAAT170.3714190.3381
QQFGRD6−0.55006−0.55006−0.55006−0.5500
RASANLAATKMSECVLG170.4414
RLITGRLQSLQTYVTQQ17−0.2774
SLQTYVTQQLIRAAEIR17−0.0120

Average0.51580.56290.61690.4958

Predicted Epitopes in [33]4V5V6V

DPFLGVYYHKNNKSWME170.5821170.5821170.5821170.5821
MDLEGKQGNFKNL131.2592131.2592131.2592
KHTPINLVRDLPQGFS160.6403170.5695170.5695
TPGDSSSGWTA110.2473120.0746170.4892170.4892
KSFTVEKGIYQTSNFRVQP190.5729190.5729190.5729
SNKKFLPF81.395281.395281.395291.1432
TNTSN5NA555
NCTEVPVAIHADQLTPT170.3987
RVYSTGSNVFQ11−0.1000130.3359130.3359
VNNSYECDIPI110.6124160.9123160.9123
YTMSLGAENSVAYSNN160.6434160.6434160.6434160.6434
EQDKNTQ70.101770.101770.101770.1017
KQIYKTPPIKDFGGF15−0.389615−0.389615−0.389615−0.3896
PDPSKPSK80.062180.062180.062180.0621
LADAGFIKQYGDCLG150.2071
EAEVQ5NA5NA5NA11−0.0004
GQSKRVDFC91.7790111.4088121.3607121.3607
RNFYEPQIITTD120.3529150.6381160.6504200.2624

Average0.52280.58330.61030.4255

Predicted Epitopes in [47]4V5V6V

RGVYYPDK81.019181.019181.0191110.5200
RSSVLHST80.5459100.5404100.5404
DLFLPFFS8−0.3099
FHAIHV61.6766180.6317180.6317180.6317
NPVLPFN70.586390.014690.014690.0146
QSLLIVN70.8168150.5156150.5156
NVVIKVCEFQ10−0.1498
CNDPFLGVYYH110.4109170.5314170.5314170.5314
FEYVSQP70.9073110.1016110.1016
INLVRDL7−0.3198140.4022140.4022170.4924
LEPLVDLP8−0.3271
QTLLALHRSY100.5596170.5921
AAYYVGYL80.5218120.9255
PRTFLLK7−1.391710−0.280010−0.280012−0.2227
AVDCALDP80.7730160.5804160.5804160.5804
TNLCPFG71.181280.8906131.0400
SNCVADYSVLYNS13−0.1828130.0152
TFKCYGVSPT101.5059200.8913200.8913
TGCVIA60.4716100.099613−0.159214−0.1234
CYFPLQSY80.939480.9394120.8719120.8719
FGGVSVIT80.7715120.4578130.4931130.4931
CTEVPVAIHAD110.0499
AGCLIGA70.1743
GAGICASY80.5210130.6871130.6871170.4587
VASQSII7−0.0188160.3257160.3257180.4018
TTEILPVS81.2071
SVDCTMY71.093217−0.0258180.1426
SNLLLQYGSFCTQL140.7599
VFAQVKQI80.5854140.3493150.4451
SQILPD6−0.154280.038380.0383110.5569
YGDCLGD7−0.5555120.5494140.0416
RDLICAQ71.1443
LTVLPPL70.6786
YTSALLAG80.3798200.3640
LNTLVKQL8−0.759116−0.064616−0.0646
ISSVLND70.0414110.7339120.6035
SLQTYVTQQ9−0.0089
SECVLGQS8−0.0110130.5417
PHGVVFLHVTYVPA140.8058
PAICHDG7−1.0100150.2145150.2145
SGNCDVVIGI100.7421
ASVVNI60.8671130.1922

Average0.39380.42580.40110.4005

Predicted Epitopes in [45]4V5V6V

VRQIAPGQTGKIAD141.2606151.3487161.0388191.1515
VLGQSKRVDFCGKG141.3582
GLTGTGVLTESNKK141.0227141.0227141.0227160.6686
KIADYNYKLPDDFT140.9567140.9567140.9567140.9567

Average1.14951.10941.0060.9256

Predicted Epitopes in [46]4V5V6V

DPFLGVYYHKNNKSWME170.5821170.5821170.5821170.5821
MDLEGKQGNFKNL131.2592131.2592131.2592
KHTPINLVRDLPQGFS160.6403170.5695170.5695
TPGDSSSGWTA110.2473120.0746170.4892170.4892
KSFTVEKGIYQTSNFRVQP190.5729190.5729190.5729
VNNSYECDIPI110.6124160.9123160.9123
YTMSLGAENSVAYSNN160.6434160.6434160.6434160.6434

Average0.61110.65910.71840.5716
As seen from Table 6, 21 of the sequences identified by BepiPred were found by the 4V method, 20 by the 5V method, and 18 by the 6V method. When peptides with different sequence lengths were compared according to their antigenicity scores, 6 peptides found by the ensemble fuzzy method had higher antigenicity scores, while BepiPred was more successful for 5 peptides. Looking at the average scores, the 5V method gives the best result. Comparison with BepiPred results. In Table 7, epitopes in other studies in the literature are compared with the proposed method. Grifoni et al. [17] identified 19 peptides as candidates for the vaccine. Of these, 12 were also estimated by the 4V and 5V methods, and 10 by the 6V methods. Considering all 3 methods, the 8 predicted peptides appeared to be more antigenic. According to the mean antigenicity score, the average of all peptides found by the 6V method was the highest. All 33 peptides identified by [39] were also predicted by the 4V method, and the 5V and 6V methods selected 30 and 22 of them, respectively. Of the peptides predicted by the proposed method, 13 have higher antigenicity scores. The peptides with the mean highest antigenicity scores are the peptides predicted by the 5V method. In [44], the authors identified 4 peptides in their study, all of them which were selected by the ensemble fuzzy method, and the developed method determined more antigen epitopes for 2 of them. Of the 16 epitopes identified in [43], 9, 6, and 5 epitopes were labeled by the 4V, 5V, and 6V methods, respectively. Except for 2, the developed method selected more antigenic epitopes. In terms of the mean antigenicity score, the 5V method was the most successful. Eighteen B-cell vaccine candidates were identified by [33], of which 16 were nominated by the 4V and 5V methods, and 11 were nominated for the vaccine by the 6V method. Considering the antigenicity of peptides of different lengths, more antigen epitopes are estimated for the 4 vaccine candidates identified by the developed method. Of the 42 B-cell epitopes identified by Rehman et al. [47], 34 were detected by the 4V method, 24 by the 5V method and 13 by the 6V method. The antigenicity scores of 15 of the epitopes found in different lengths by the ensemble fuzzy method were higher. Comparison results with literature. Lin et al. [45] identified 4 epitopes in their study, 3 of which were found by the fuzzy method at all sensitivity levels. However, the epitopes found by [45] seemed to have higher average antigenicity scores. All 7 peptides presented in the study [46] were predicted by the 4V and 5V methods, but the 6V method detected 3 of them. Considering the antigenicity scores of epitopes of different lengths, the 5V method detected more antigen epitopes. When the results of the methods presented in the literature are compared with the developed method, it is seen that most of the epitopes suggested as vaccine candidates for SARS-CoV-2 with many different methods can also be detected by the ensemble fuzzy method. This shows that the proposed ensemble fuzzy method is robust. Among the 4V, 5V, and 6V methods obtained by combining the decisions with different majority decisions, the 5V method was generally more successful in terms of the average antigenicity score.

Discussion

SARS-CoV-2 B-cell epitope identification with the aid of a high-performance prediction method contributes to rapid, reliable, and effective protein-based vaccine development. The use of experimental methods in vaccine development is quite time-consuming, costly, and labor-intensive. Therefore, the main aim of this study was to propose a method that can predict B-cell epitopes with high accuracy. The results obtained from the study show that we have achieved this goal. The SARS-CoV B-cell epitope prediction of the five different fuzzy learning classification methods in different test data minimal error rate was 7.5% (SLAVE), and the maximal error rate was 52.5% (W), while the minimal error rate of the ensemble fuzzy method was 5% and the maximal error rate was 12.5%. When the average errors of the test results were compared, the proposed method had the lowest error rate of 8.33%, followed by the SLAVE method at 20.42%. The mean error of the most successful fuzzy learning model in SARS-CoV B-cell epitope prediction was approximately 2.5 times higher than of the proposed model. From the obtained results, it is clearly seen that the proposed method outperformed other methods. This shows that ensemble learning methods are more successful than individual methods. The main advantage of the proposed method is that the decisions made by fuzzy methods are combined with an ensemble-based structure. The fuzzy methods used include different learning and decision-making approaches, as explained in Section 3.2. It has been seen in the majority voting and decision aggregation phase that different fuzzy methods learn different features in the dataset. Thus, the average spread of a model that contributes to the ensemble is reduced, and the average prediction performance for each model is improved. In fact, as clearly shown in Table 3 the aggregation of decisions improved the estimation performance of single fuzzy classifiers. The prediction accuracies of the studies on B-cell epitope prediction in the literature were compared with the prediction accuracies obtained from this study. In [54], different machine learning methods were compared and it was reported that the most successful method was the ensemble method with an accuracy value of 87.8%. In [51], Bayesian neural networks with drop-weight models were proposed for epitope prediction. The prediction accuracy of the proposed model was 85%. In [53], the attentional mechanism LSTM network approach was used for epitope prediction. With this model, epitopes were predicted with 79% accuracy. In this study, the proposed ensemble classifier outperformed other studies in the literature, with a minimum accuracy of 87.5% and maximum accuracy of 95.0% epitope prediction. In this study, we determined that the dataset was imbalanced; therefore, we applied the subsampling preprocessing step. Thus, we randomly generated 6 different sub-datasets with the positive and negative class labels of the samples balanced. The SARS-CoV B-cell epitope prediction accuracies of the proposed ensemble fuzzy model are illustrated in Fig. 9.
Fig. 9

Confusion matrices of the proposed ensemble fuzzy model for different test sets.

The dataset used in the study consists of all possible subsequences of the SARS-CoV-2 spike protein of different lengths, and there is no label information to definitively determine whether a sequence is an epitope or not. Fuzzy logic approaches enable us to take into account imprecise information while making a decision. As indicated in the predictions, it can measure more sensitively than classical logic-based classification methods. The results obtained showed that it was more successful than other machine learning methods evaluated in the literature. Confusion matrices of the proposed ensemble fuzzy model for different test sets. In addition, the SARS-CoV-2 B-cell epitope prediction results obtained in this study were validated with the results reported in the literature and the epitope results predicted by the BepiPred server. Furthermore, antigenicity scores of SARS-CoV-2 B-cell epitopes were measured. Thus, we hope that the information obtained from this study will help develop an effective protein-based vaccine against SARS-CoV-2.

Conclusion

In this study, an efficient fuzzy learning-based model is proposed that predicts potential epitopes to assist the first stage of mRNA-based vaccine development. The SARS-CoV B-cell epitope prediction performances of five different fuzzy learning methods were examined and compared with the proposed method. According to the results obtained, the proposed ensemble fuzzy method showed superior performance with maximal 95% accuracy and average 91.7% accuracy to other methods. After that, with the proposed method, B-cell epitope prediction was made in unlabeled SARS-CoV-2 data, and the epitopes found were confirmed with those reported in the literature and the predictions of the BepiPred server. Moreover, antigenicity scores were measured for protein sequences of epitopes identified using the VaxiJen server. The virus is still spreading rapidly and therefore mutating. It has been determined that some mutations can escape vaccines [15], [62]. The fact that the virus mutates and may require the redevelopment of vaccines increases the importance of using in silico methods. However, if these mutations occur outside of the identified epitope regions, the results will not be affected. Therefore, the identified epitopes can be used as potential antigens from which more detailed assays can be conducted in vitro to evaluate vaccine efficacy. It is anticipated that the information obtained from this study will contribute to the development of vaccines against different epidemics that may occur in the future, especially SARS-CoV-2 and its possible mutations.

Code availability statement

The source code has been available at https://github.com/ZBaOz/Epitope-Identification.

CRediT authorship contribution statement

Zeynep Banu Ozger: Conceptualization, Writing – original draft, Methodology, Validation, Software, Visualization, Writing – review & editing. Pınar Cihan: Writing – original draft, Methodology, Validation, Software, Visualization, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
  50 in total

1.  Hybridization of fuzzy GBML approaches for pattern classification problems.

Authors:  Hisao Ishibuchi; Takashi Yamamoto; Tomoharu Nakashima
Journal:  IEEE Trans Syst Man Cybern B Cybern       Date:  2005-04

2.  Contributions of the structural proteins of severe acute respiratory syndrome coronavirus to protective immunity.

Authors:  Ursula J Buchholz; Alexander Bukreyev; Lijuan Yang; Elaine W Lamirande; Brian R Murphy; Kanta Subbarao; Peter L Collins
Journal:  Proc Natl Acad Sci U S A       Date:  2004-06-21       Impact factor: 11.205

3.  ViPR: an open bioinformatics database and analysis resource for virology research.

Authors:  Brett E Pickett; Eva L Sadat; Yun Zhang; Jyothi M Noronha; R Burke Squires; Victoria Hunt; Mengya Liu; Sanjeev Kumar; Sam Zaremba; Zhiping Gu; Liwei Zhou; Christopher N Larson; Jonathan Dietrich; Edward B Klem; Richard H Scheuermann
Journal:  Nucleic Acids Res       Date:  2011-10-17       Impact factor: 16.971

4.  Circulating SARS-CoV-2 spike N439K variants maintain fitness while evading antibody-mediated immunity.

Authors:  Emma C Thomson; Laura E Rosen; James G Shepherd; Roberto Spreafico; Ana da Silva Filipe; Jason A Wojcechowskyj; Chris Davis; Luca Piccoli; David J Pascall; Josh Dillen; Spyros Lytras; Nadine Czudnochowski; Rajiv Shah; Marcel Meury; Natasha Jesudason; Anna De Marco; Kathy Li; Jessica Bassi; Aine O'Toole; Dora Pinto; Rachel M Colquhoun; Katja Culap; Ben Jackson; Fabrizia Zatta; Andrew Rambaut; Stefano Jaconi; Vattipally B Sreenu; Jay Nix; Ivy Zhang; Ruth F Jarrett; William G Glass; Martina Beltramello; Kyriaki Nomikou; Matteo Pizzuto; Lily Tong; Elisabetta Cameroni; Tristan I Croll; Natasha Johnson; Julia Di Iulio; Arthur Wickenhagen; Alessandro Ceschi; Aoife M Harbison; Daniel Mair; Paolo Ferrari; Katherine Smollett; Federica Sallusto; Stephen Carmichael; Christian Garzoni; Jenna Nichols; Massimo Galli; Joseph Hughes; Agostino Riva; Antonia Ho; Marco Schiuma; Malcolm G Semple; Peter J M Openshaw; Elisa Fadda; J Kenneth Baillie; John D Chodera; Suzannah J Rihn; Samantha J Lycett; Herbert W Virgin; Amalio Telenti; Davide Corti; David L Robertson; Gyorgy Snell
Journal:  Cell       Date:  2021-01-28       Impact factor: 66.850

5.  Epitope-based peptide vaccine design and target site depiction against Middle East Respiratory Syndrome Coronavirus: an immune-informatics study.

Authors:  Muhammad Tahir Ul Qamar; Saman Saleem; Usman Ali Ashfaq; Amna Bari; Farooq Anwar; Safar Alqahtani
Journal:  J Transl Med       Date:  2019-11-08       Impact factor: 5.531

Review 6.  Decoding Covid-19 with the SARS-CoV-2 Genome.

Authors:  Phoebe Ellis; Ferenc Somogyvári; Dezső P Virok; Michela Noseda; Gary R McLean
Journal:  Curr Genet Med Rep       Date:  2021-01-09

Review 7.  Innate and adaptive immune responses to SARS-CoV-2 in humans: relevance to acquired immunity and vaccine responses.

Authors:  S C Jordan
Journal:  Clin Exp Immunol       Date:  2021-03-04       Impact factor: 5.732

Review 8.  The SARS-CoV nucleocapsid protein: a protein with multifarious activities.

Authors:  Milan Surjit; Sunil K Lal
Journal:  Infect Genet Evol       Date:  2007-07-20       Impact factor: 3.342

9.  Time series forecasting of COVID-19 transmission in Canada using LSTM networks.

Authors:  Vinay Kumar Reddy Chimmula; Lei Zhang
Journal:  Chaos Solitons Fractals       Date:  2020-05-08       Impact factor: 5.944

Review 10.  Origin and genomic characteristics of SARS-CoV-2 and its interaction with angiotensin converting enzyme type 2 receptors, focusing on the gastrointestinal tract.

Authors:  Michail Galanopoulos; Aris Doukatas; Maria Gazouli
Journal:  World J Gastroenterol       Date:  2020-11-07       Impact factor: 5.742

View more
  1 in total

1.  Identification and analysis of B cell epitopes of hemagglutinin of H1N1 influenza virus.

Authors:  Qing Feng; Xiao-Yan Huang; Yang-Meng Feng; Li-Jun Sun; Jing-Ying Sun; Yan Li; Xin Xie; Jun Hu; Chun-Yan Guo
Journal:  Arch Microbiol       Date:  2022-09-02       Impact factor: 2.667

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.