Literature DB >> 35832615

iKcr_CNN: A novel computational tool for imbalance classification of human nonhistone crotonylation sites based on convolutional neural networks with focal loss.

Lijun Dou1,2,3, Zilong Zhang2, Lei Xu4, Quan Zou1,2.   

Abstract

Lysine crotonylation (Kcr) is a newly discovered protein post-translational modification and has been proved to be widely involved in various biological processes and human diseases. Thus, the accurate and fast identification of this modification became the preliminary task in investigating the related biological functions. Due to the long duration, high cost and intensity of traditional high-throughput experimental techniques, constructing bioinformatics predictors based on machine learning algorithms is treated as a most popular solution. Although dozens of predictors have been reported to identify Kcr sites, only two, nhKcr and DeepKcrot, focused on human nonhistone protein sequences. Moreover, due to the imbalance nature of data distribution, associated detection performance is severely biased towards the major negative samples and remains much room for improvement. In this research, we developed a convolutional neural network framework, dubbed iKcr_CNN, to identify the human nonhistone Kcr modification. To overcome the imbalance issue (Kcr: 15,274; non-Kcr: 74,018 with imbalance ratio: 1:4), we applied the focal loss function instead of the standard cross-entropy as the indicator to optimize the model, which not only assigns different weights to samples belonging to different categories but also distinguishes easy- and hard-classified samples. Ultimately, the obtained model presents more balanced prediction scores between real-world positive and negative samples than existing tools. The user-friendly web server is accessible at ikcrcnn.webmalab.cn/, and the involved Python scripts can be conveniently downloaded at github.com/lijundou/iKcr_CNN/. The proposed model may serve as an efficient tool to assist academicians with their experimental researches.
© 2022 The Authors.

Entities:  

Keywords:  Convolutional neutral networks; Focal loss function; Imbalance classification; Nonhistone crotonylation modification

Year:  2022        PMID: 35832615      PMCID: PMC9251780          DOI: 10.1016/j.csbj.2022.06.032

Source DB:  PubMed          Journal:  Comput Struct Biotechnol J        ISSN: 2001-0370            Impact factor:   6.155


Introduction

As one of the important post-translational modifications (PTMs), lysine crotonylation (Kcr) was first discovered by Tan et al. in 2011 [1]. It usually exists on histone proteins in the chromatin region with active transcription processes, thereby playing significant roles in reproductive regulation, gene expression, and chromatin structure [2], [3], [4]. In 2017, Xu et al. first confirmed the occurrence of Kcr modification on nonhistone proteins [5], and proved that it is widely localized in the cytoplasm and nucleus of H1299 and HeLa cells, as well as a variety of mouse tissues. Subsequent studies elucidated crucial roles in various physiological and pathological processes, indicating its significance in academic researches and medical applications [6], [7]. Predicting Kcr positions is the first step to proceed with mechanism investigation. Experimentally, multiple high-throughput techniques have been developed, such as high-performance liquid chromatography fractionation (HPLC), HPLC-tandem mass spectrometry (MS) [6], [7], [8]. However, these methods are time-consuming, expensive and labour-intensive, bringing great difficulties to large-scale analysis. Therefore, developing bioinformatics tools based on mathematics and statistics theories become a promising alternative to address this issue. In particular, machine-learning (ML)-based predictors exhibited considerable advantages in terms of time cost and budget, and presented satisfactory prediction results. To date, a total of eleven predictors have been published for the identification of protein Kcr sites. As summarized in Table 1, the first tool, called CrotPred, applied the discrete hidden Markov model (DHMM) to explore histone Kcr sites in 2016 [9]. Later, five traditional ML-based models were developed, including support vector machine (SVM) model by Qiu et al. [10], CKSAAP_CrotSite (SVM) [11], iKcr-PseEns (ensemble random forest, RF) [12], LightGBM-CroSite (LightGBM) [13], and random forest (RF)/SVM classifiers by Wang et al. [14]. The remaining five were all built on the frame of deep learning (DL), including iCrotoK-PseAAC [15], pKcr [16], Deep-Kcr [17], DeepKcrot [18] and nhKcr [19]. Throughout these predictors, we can observed that: in terms of datasets, the early tools mainly concentrated on histone or mixed data, whereas the recent works began to study nonhistone proteins due to the enrichment of high-throughout nonhistone data. Accordingly, the number of samples sharply increased from 34/90 (the number of positive samples over negative samples) in the first model CrotPred [9] to 15,605/75,111 in the latest model nhKcr [19], which can effectively guarantee the statistical significance of the constructed models; in terms of protein features, it roughly covered several classical easy-interpreted descriptors (i.e., composition of k-spaced amino acid pairs (CKSAAP), one-hot, enhanced amino acid composition (EAAC), pseudo-amino acid composition (PseAAC), pseudo-position specific scoring matrix (PsePSSM), etc) and deep learning representation embedding methods (i.e., wording embedding (WE)); in terms of algorithms, researchers were more inclined to choose deep learning techniques rather than conventional classifiers; as for model evaluation, recent works strictly completed cross validation and independent tests to get objective and reliable results.
Table 1

Description of eleven computational tools for protein Kcr prediction.

ToolsSpeciesTrainTestFeatures(selection)(Imbalance)ClassifierResultsSnSpMCCAUCYear
nhKcrnonhistone-human12,262/60,1013,343/15,010BE, AAINDE, BLOSUM62CNN5-cv62.8690.000.510.882021
Inde58.9090.000.480.88
DeepKcrotnonhistone-human6,687/67,1061,483/16,497WECNN5-cv53.7090.000.340.862021
Inde52.4090.000.340.86
nonhistone-papaya2,742/29,676711/7,458Inde0.88
nonhistone-rice909/11,780225/2,734Inde0.86
nonhistone-tabacum1,643/9,696451/2,449Inde0.84
Wang's workhistone-mammalian167/16744/95AAC, AAPC, BE, CKSAAP, EAAC, EGAAC(CHI2)(RUS)RF.SVMInde92.0088.000.802020
nonhistone-plant2,548/2,548669/6,720Inde83.0070.000.540.84
LightGBM-CroSitehistone-human,mouse159/847BE, PWAA, EBGW, KNN, PsePSSM (Elastic)(SMOTE) LightGBMJack98.8699.110.981.002020
Deep-Kcrmixed-humanAll: 9,964/9,964 (Train/Test = 7:3)CSKAAP, PWAA, AAIndex, CTD, EBGWCNN10-cv0.862020
pKcrnonhistone-papaya2,742/29,676711/7,458WECNN10-cv51.6990.000.340.862019
Inde53.6790.000.340.85
iCrotoK-PseAACmixed-mixed378/500SVV, SM, PRIM, RPRIM, FV, AAPIV, RAAPIVANNJack99.1799.4098.002019
iKcr-PseEnshistone-human,mouse169/866PseAACensemble RFJack90.5395.2781.260.982018
CKSAAP_CrotSitehistone-human,mouse169/847CKSAAP, PseAACSVMJack92.4599.1792.832017
Qiu's workhistone protein159/847PWAASVMJack71.6998.7077.802017
CrotPredhistone-mixed34/90DHMMJack79.4177.7852.592016

Asbbreviations: EBGW, encoding based on grouped weight; kNN, k nearest neighbors; PsePSSM, pseudo-position specific scoring matrix; PWAA: position weight amino acid composition; CTD, composition, transition and distribution; RPRIM, position relative incidence matrix; SVV, site vicinity vector; FV, frequency vector; AAPIV, accumulative absolute position incidence vector; RAAPIV, reverse accumulative absolute position incidence vector; CNNs: convolutional neural networks; RF: random forest; SVM: support vector machine; DHMM: discrete hidden Markov model.

Description of eleven computational tools for protein Kcr prediction. Asbbreviations: EBGW, encoding based on grouped weight; kNN, k nearest neighbors; PsePSSM, pseudo-position specific scoring matrix; PWAA: position weight amino acid composition; CTD, composition, transition and distribution; RPRIM, position relative incidence matrix; SVV, site vicinity vector; FV, frequency vector; AAPIV, accumulative absolute position incidence vector; RAAPIV, reverse accumulative absolute position incidence vector; CNNs: convolutional neural networks; RF: random forest; SVM: support vector machine; DHMM: discrete hidden Markov model. In machine learning fields, another challenge is data imbalance, indicating the uneven distribution of samples belonging to different classes. The degree of imbalance can be expressed by the imbalance ratio (IR),where Nmajority and Nminority indicate the number of samples in the majority and minority classes, respectively. The data imbalance (columns 3 and 4 in Table 1) is actually a common problem in machine learning fields and usually leads to the prediction preference on the samples in the majority class. For instance, the imbalance ratio (IR) in nhKcr tool achieves 12262/60101≈1/5 for the training dataset. As can be seen in the column “(Imbalance) Classifiers”, two imbalance strategies of random under-sampling (RUS) [20] and synthetic minority oversampling technique (SMOTE) [21] were implemented to balance the datasets before modelling. Considering the limited amount of data available and the lack of independent test for the early models, here we only discussed two newest tools focused on nonhistone Kcr modification. For the DeepKrot model, the prediction specificity (Sp) for negative samples over 5-fold cross validation (5-CV) and independent tests both achived 90.00%, whereas the associated sensitivity (Sn) for positive samples were only 53.70% and 52.40%, respectively. Similarly, nhKcr separately presented unsatisfied Sp values of 62.86% and 58.90% over 5-CV and independent tests. Generally, compared with the average prediction efficiency (∼60%) of positive segments, these two tools are obviously skewed towards negatives (∼90%). In the prediction process using above models, the true Kcr sites what experts are actually more concerned about are often negleted. Therefore, it is urgent to develop new Kcr predictors with better and balanced detection performance. In this research, applying the large-scale nonhistone Kcr data, we developed a novel computational tool based on the neural convolutinal networks (CNNs), named iKcr_CNN, to determine moified Kcr positions. As illustrated in Fig. 1, the modelling process includes five main parts: (1) nonhistone Kcr data collection; (2) protein sequence encoding; (3) one-dimensional CNNs architecture with FL loss; (4) performance evaluation; and (5) web server establishment. Meanwhile, we implemented the powerful dimension reduction method of uniform manifold approximation and projection (UMAP [22]) to visually analyze the data distribution. We demonstrate that the proposed predictor is a reliable tool (available at .) to recognize potential Kcr sites in nonhistone proteins.
Fig. 1

Flowchart of the iKcr_CNN predictor, including data collection, sequence encoding, CNNs-based model development, performance evaluation, web server establishment as well as UMAP visualization of data distribution.

Flowchart of the iKcr_CNN predictor, including data collection, sequence encoding, CNNs-based model development, performance evaluation, web server establishment as well as UMAP visualization of data distribution.

Materials and methods

Benchmark datasets

In this research, we used the benchmark datasets from the related literature and online datasets by Chen et al. for modelling [19], which includes large number of non-redundant experimentally verified Kcr sites on human nonhistone sequences. In short, it was collected through the following steps: first, 19,287 Kcr sites involved in 4230 proteins were filtered from the UniProt database [23]; then, the threshold of 0.3 was set in the CD-HIT program [24] to remove redundant segments, which can effectively avoid the overfitting problem caused by sample similarity; later, centered on the residue K, these protein sequences were further split into 15,603 Kcr and 164,709 non-Kcr segments with a fixed window length of 29, i.e. 14 acids on upstream and 14 acids on downstream; finally, 12,262 positive and 60,101 negative samples were collected (marked as 12,262/60,101) for the training dataset and 3,343/15,010 for the testing dataset. Besides, several samples with auto-filled residue “O” or sparse amino acid “U” were further deleted to avoid potential interference or feature extraction error. Ultimately, the training and testing datasets contained a total of 12,022/59,226 and 3,252/14,792 samples, respectively. Corresponding IR between Kcr and non-Kcr samples is approximately 1:5. More details on data preparation process can be found in Ref. [19].

Protein descriptors

Translating the protein fragments into a computer-recognizable numeric vector is a fundamental step to construct a superior model. In the current study, we explored the effectiveness of eleven common descriptors, which can be generally grouped into four categories: (1) sequential information: one-hot, EAAC and CKSAAP; (2) evolutionary information: bi-profile Bayes (BPB), BLOSUM62 and position-specific scoring matrix (PSSM); (3) physicochemical information: AAIndex, 188D; and (4) deep representation learning embedding information: bi-directional long short-term memory (BiLSTM) [25], word2vec (W2V) [26], and encoder representation from transformers (BERT) [27]. Given a protein sample R = RR…RR, the mentioned descriptors can be briefly described as follows: Sequential information. In the one-hot method, 20 different residues, ‘ARNDCQEGHILKMFPSTWYV’, are separately encoded as 10000000000000000000, 01000000000000000000, …, 00000000000000000001, generating a 20*l-dimensional sparse matrix. EAAC calculates the frequencies of individual residues in the fixed k-length segments (default k = 5), inducing (l-k + 1)*20-D features. Similarly, CKSAAP computes the occurrence probabilities of k-spaced amino acid pairs. When k = 0, it includes 20*20 0-spaced amino acid pairs (i.e., AA, AC, AD, …). When k = 1, it counts 400 1-spaced pairs (i.e., A-A, A-C, A-D, …), etc. Finally, it generates a (k + 1)*400-D feature vector. Evolutionary information. BPB calculates the position-specific probabilities of 20 amino acids of positive and negative subsets, and applies them directly to define 2*l-D features. BLOSUM62 implements the simple protein evolutionary matrix BLOSUM62 through the basic local alignment search tool (BLAST) to encode protein sequences with the dimension of 20*l. Furthermore, PSSM strictly aligns the sequence with large protein sequences in NR database (ftp://) to obtain a 20*l-D scoring matrix locally. Physicochemical information. AAIndex incorporates 531 physicochemical properties of amino acids to obtain l*531-D peptide features by iLearn toolkit [28]. 188D features can be mainly divided into two parts. The first one calculates the amino acid composition (abbreviated as AAC), which forms a 20-D feature vector. The second part comprehensively integrates composition, transition, and distribution values of eight important physicochemical attributes (CTD) and forms a 168-D feature vector [29], [30]. More specific, these physiochemical properties include hydrophobicity, normalized Vander Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility. Ultimately, it generates a 188-D protein feature vector in total to encode protein sequences. Deep representation learning embedding methods. By analogy of biological sequences as text, bioinformatics experts successfully applied multiple advanced natural language processing (NLP) techniques into proteomics and genomics with outstanding results [31], [32]. Here, we exploited three deep learning embedding methods of BiLSTM [25], BERT [27], and W2V [26]. BiLSTM combines forward and backward LSTM based on the bidirectional propagation mechanisms to extract contextual information [25], well reflecting the global structure similarity between proteins and pairwise residue contact maps for every segment [33]. BERT uses the transformer attention mechanism to realize multiple focus points for the same sentence simultaneously [27]. W2V characterizes residues by the similarity of the involved context/similarity [26]. These features mentioned above can be conveniently acquired through state-of-the-art toolkits, such as iLearn, eFeature (), BioSeq-Analysis [34], [35], [36].

Classification algorithms

In recent years, DL technology has made remarkable achievements in image recognition, autonomous driving, NLP, etc. It also demonstrates excellent performance on bioinformatics subjects, such as the prediction of protein structure, protein-protein interactions (PPIs), drug design, and disease treatment [37], [38]. Regarding PTMs detection, dozens of state-of-the-art tools have been proposed in the application of CNNs, transfer learning (TL), LSTM and attention-based networks, such as MusiteDeep [39], CapsNet_PTM [40], MultiRM [41]. As illustrated in Fig. 2, our CNNs architecture consists an input layer, three 1D convolutional layers, two fully connected layers, and an output layer. Combining the involved model parameters in Table 2, 812-D AAindex features were served as input and fed into the network; Then, three sequentially connected blocks (i.e., convolutional with a rectified linear unit (RELU) activation function, max pooling, and dropout layers) were followed to extract hidden discriminative patterns. In specific, the number of filters was separately set as 200, 100 and 50 for three convolutional layers with kernel sizes of 7, 5 and 3, as well as the same pooling sizes of 2. Next, the output of Dropout_3 with a size of (98,50) was flattened into a 4,900-D vector and continuously fed into two dense layers to finally generate a 1-D matrix, which was activated by the sigmoid function to output the prediction probability. Here the dropout rates were set as 0.25 for the first three dropout layers and 0.5 for the last, and the batch size as 150 to avoid out of memory error. During the fitting process, we chose fashionable optimizer of Adam with the default learning rate of 0.001 to optimize the involved 1,097,151 parameters in total. This model is implemented in TensorFlow (v1.12.0) and Keras (v2.2.4) [42]. For comparison, we also investigated several traditional machine learning methods in Scikit-Learn package (v0.24.2) [43], including RF [44], SVM [45] and naïve Bayes (NB) [46].
Fig. 2

CNNs architecture of this work. AAindex features are continuously fed into three 1D convolutional blocks (convolution, max pooling and dropout layers) to extract informative attributes. Then, one flatten and two dense layers followed by a sigmoid activation function are incorporated to output the prediction results.

Table 2

Parameters involved in this CNNs model.

LayersSettingsOutput shapeParameters
Inputselected AAindex(-,812)
Conv1D_1filters = 200, kernel_size = 7, activation='relu'(-,860,200)1600
Maxpooling_1pool_size = 2(-,403,200)
Dropout_1rate = 0.25(-,403,200)
Conv1D_2filters = 100, kernel_size = 5, activation='relu'(-,399,100)100,100
Maxpooling_2pool_size = 2(-,199,100)
Dropout_2rate = 0.25(-,199,100)
Conv1D_3filters = 50, kernel_size = 3, activation='relu'(-,197,50)15,050
Maxpooling_3pool_size = 2(-,98,50)
Dropout_3rate = 0.25(-,98,50)
Flatten(-,4900)
Dense_1unit = 200, activation='relu'(-,200)980,200
Dropout_4rate = 0.5(-,200)
Dense2unit = 1, activation='sigmoid'(-,1)201
Adamlr = 0.001
Total1,097,151
CNNs architecture of this work. AAindex features are continuously fed into three 1D convolutional blocks (convolution, max pooling and dropout layers) to extract informative attributes. Then, one flatten and two dense layers followed by a sigmoid activation function are incorporated to output the prediction results. Parameters involved in this CNNs model.

Imbalance strategy

Data imbalance exists in almost all machine learning fields, such as credit fraud, information security, image processing, bioinformatics [47]. It is known that traditional algorithms proceed optimization by maximizing the overall prediction accuracy. As a result, classification model usually illustrates severe skewness towards the majority class, which is actually converse to the samples of interest to experts, seriously limiting the reliability and applicability of proposed tools. Fortunately, various imbalance strategies have been proposed, such as SMOTE [21], RUS [20], cost-sensitive, ensemble classifiers, which can be roughly grouped into three parts: data-, algorithm-, and hybrid-level methods [48], [49]. Taking data-level SMOTE as an example [21], it synthesizes new samples according to the data similarity of the k-nearest neighbor (KNN) samples randomly chosen from the minority samples , formulated as.Here, will produce a random number between 0 and 1. Ultimately, we can generate a large number of new minority class samples to form a balanced dataset and applied to build model. In the frame of DL, the imbalance issue can be similarly addressed by adjusting class weights, evaluation matrices, and loss functions. For instance, the online hard example mining (OHEM) method increases the weight of mis-predicted samples to obtain better results. However, it neglects the influence of easy-classified samples. As an improvement, Lin et al. proposed focal loss in 2017 when comparing one- and two-stage object detection techniques and presented excellent results [50]. It is mainly benefited from the initial precise identification of valid target areas, which in fact is an imbalance issue. In the past five years, FL has become a powerful way to address imbalance issue [50], [51], [52], [53], which not only assigns asymmetrical weights to samples in different classes but also sets different weights on easy- and hard-classified samples. Specifically, the standard cross entropy (CE) in two-class classification is expressed as.Here, indicates the true label of considered samples (1: positive; 0: negative), and is the predicted probability of the query samples for category in the range 0–1. Conveniently, can be writing as. Then, CE can be shortened as. FL improves the loss function from two hands as follows: Distinguishing the contribution of samples of different categories to the loss function by introducing the weight parameter for class 1 and for class −1. We define analogously to , then the α-balanced CE loss can be written as, Distinguishing the contribution of easy- and hard-classified samples to the loss function by introducing the focusing parameter (), CE will be rewritten as. Ultimately, FL can be ultimately written as. In summary, exposes different weights of samples belonging to positive and negative classes, and , called the modulating factor, decreases the contributions of easy-classified samples. As a result, FL makes classifiers mostly focused on the minority class and hard-classified samples, as we expected. The best values of and can be optimized on the integrated deep learning platform using grid search using the grid search method of and ].

Performance evaluation

Here, 5-fold CV and independent tests were both conducted to evaluate model performance. In 5-fold CV, the training dataset is randomly divided into five sub-datasets, in which four are used to train the model and the remaining one to test. It is not completed until all five sub-datasets are applied for training once. In addition, independent test is performed to check model generalizability. Six criteria are used to quantitively measure model efficiency, including the sensitivity (Sn), specificity (Sp), accuracy (Acc), Matthew's correlation coefficient (MCC), precision (Pre) and F1 score, formulated as follows,Here, TP = true positive, TN = true negative, FP = false positive, and FN = false negative. As supplements, the receiver operating characteristic (ROC, false-positive rate (FPR) vs. true-positive rate (TPR)) curve, PR curves (precision vs. recall) and relevant areas under curves (AUROC, AUPRC) are also illustrated as objective induces because of independence of the threshold.

Results and discussion

In the current study, we applied the incorporated features of selected AAIndex (812-D) to represent proteins and developed an novel convolutional neural network framework to find potential human nonhistone Kcr sites. To eliminate the prediction bias on negative samples, we applied the focal loss function instead of the standard cross-entropy as the indicator to guide the optimization process. This concise model ultimately demonstrated 77.31% prediction score for true Kcr sites, and 78.62% for false Kcr sites with AUC value of 0.86 over 5-fold CV, as well as 77.87% for true Kcr sites, 76.61% for false Kcr sites with AUC of 0.85 over independent test. Our model presented more well results with real-world datasets. The user-friendly web server is accessible at .

Analysis of residues distribution

At first, we analyzed the statistical distribution characteristics of position-specific residues between the positive and negative subsets (t-test: p < 0.05) in Fig. 3 using the Two Sample Logo platform [54]. Different residues are colored by the charge property, i.e., blue, red, and black mean the positively, negatively charged, and neutral residues, respectively. Overall, apparent differences can be observed between Kcr fragments (upper panel) and non-Kcr fragments (lower panel), where the charged residues (blue and red) were significantly enriched in the modified samples, and contrarily neutral residues (black) in the nonmodified samples. Combining the position preference, we can observe that for the residues close to the center K (−4–4, circled by a green box), negatively charged residues (E, D) and neutral residues (A, N, V, Y, etc) are more likely located in the positive set, whereas the positively charged amino (R, K) as well as multiple neutral residues (P, S, F, etc) in the negative set. In particular, six residues close to the center show higher preference values than 4%, including E (6.9%) in the positive set, K (8.1%) and R (5.8%) in the negative set in the upstream position −1, as well as E (6.3%), D (4.0%) in the positive set, P (4.4%) in the negative set in the downstream position 1. The noticeable position-specific distinguishment is fundamental to building reliable computational Kcr tools.
Fig. 3

Two Sample Logo of human nonhistone Kcr segments (p < 0.05), where the positively charged, negatively charged and neutral residues are separately indicated in blue, red, and black [54]. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Two Sample Logo of human nonhistone Kcr segments (p < 0.05), where the positively charged, negatively charged and neutral residues are separately indicated in blue, red, and black [54]. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Preliminary results of multiple protein descriptors and imbalance/classification algorithms

As depicted in Fig. 1, the modelling process is a complex process where the performance depends on many factors (features, algorithms, hyperparameters, etc.). Thus, it is hard to obtain the absolutely best-performed model, and we can only screen out the locally best one based on the limited considerations in the current study. Here, we proceeded a series of preliminary experiments to assess the performance of several commonly used feature extraction methods (including One-hot, EAAC, CKSAAP, PSSM, AAIndex, etc.). Notably, for each feature representation, we applied multiple imbalance strategies/classifiers (including SMOTE, RF, SVM, CNN (CE), CNN(FL), etc.). After a robust/systematic comparison, we filtered out the best-performed one as the candidate model to carry out further optimization and analysis, which integrates the protein feature AAIndex_nhKcr and the classifier CNNs with FL. Among these preliminary experiments, we summarized parts of sequence representation results based on CNNs (FL) in Table 3 and different classifiers based on AAIndec_nhKcr features in Table 4 to discuss, respectively. Since the strict implement of standard 5-CV experiment with a large amount of training samples is time-consuming, we split the raw training dataset into two sub-datasets with a ratio of 8:2, marked as data_training (9,599/47,399) and data_validation (2,423/11,827), to finish validation experiment. Ultimately, the associated results of the validation and independent tests were obtained, where second column “Num_Feas” indicates the number of features involved.
Table 3

Performance of different protein representation approaches based on the CNNs(FL) algorithm.

FeaturesNum_FeasValidation
Independent
Sn (%)Sp (%)Acc (%)MCCAUCSn (%)Sp (%)Acc (%)MCCAUC
One-hot58056.2079.8275.830.310.7662.1075.8073.330.310.77
EAAC50074.2474.7474.660.390.8276.7375.8375.990.430.84
CKASSP (kmax = 0)40069.2362.0263.240.240.7167.2362.5163.360.230.70
CKSAAP (kmax = 1)80064.7966.0765.850.240.7160.6570.6068.810.250.72
CKSAAP (kmax = 2)120056.9473.3670.590.240.7261.4568.7767.450.240.71
CKSAAP (kmax = 3)160065.3267.1266.820.250.7262.0470.6369.080.260.72
BPB5859.1672.5670.300.250.7366.9267.3167.240.270.74
BLOSUM6258081.4071.8973.500.410.8477.4774.7875.270.420.83
PSSM58075.0874.2074.340.390.8270.9574.4273.790.370.81
AAIndex15,39971.6473.9273.530.360.8178.7972.3573.510.410.83
AAIndex_nhKcr81275.2478.7378.140.440.8581.4373.3674.800.440.84
188D18876.8260.1963.000.280.7572.5267.7268.580.320.77
BiLSTM360555.9569.6267.310.200.6956.9367.0565.230.190.68
TAPE_BERT76874.4956.6259.640.230.7168.8963.2364.250.250.72
W2V30061.5262.2362.110.180.6759.7663.9563.200.190.67
nhKcr252375.6078.4077.930.440.8574.3076.5376.130.420.83
nhKCR + EAAC + PSSM360380.5774.1675.240.430.8582.6971.9173.850.430.85

Num_Feas: the number of features.

AAIndex_nhKcr: the AAindex-related 841-D features by considering top 29 most important physicochemical properties in the predictor nhKcr [19].

nhKcr: the combined protein features proposed in the predictor nhKcr [19].

Table 4

Comparison of different imbalance strategies/classifiers based on the selected AAindex_nhkcr features.

ClassifiersValidation
Independent
Sn (%)Sp (%)Acc (%)MCCAUCSn (%)Sp (%)Acc (%)MCCAUC
RF81.3899.6890.530.820.974.2499.4482.280.130.77
SVM44.5873.6468.730.150.6438.8675.6869.040.130.63
NB61.9966.9666.120.220.7162.2267.1366.250.230.71
CNN (CE)22.7997.0284.490.300.8536.8093.9783.660.370.85
SMOTE + RF81.5399.5990.560.820.974.1599.4782.290.130.77
SMOTE + SVM85.7170.5878.140.570.8950.1167.6364.470.140.63
SMOTE + NB78.3173.4175.860.520.8443.7773.5168.150.150.65
SMOTE + CNN (CE)82.5897.1089.840.810.9733.7593.9983.130.340.84
CNN (αCE)77.9377.7677.790.450.8580.5174.3075.420.440.85
CNN (FL)75.2478.7378.140.440.8581.4373.3674.800.440.84
Performance of different protein representation approaches based on the CNNs(FL) algorithm. Num_Feas: the number of features. AAIndex_nhKcr: the AAindex-related 841-D features by considering top 29 most important physicochemical properties in the predictor nhKcr [19]. nhKcr: the combined protein features proposed in the predictor nhKcr [19]. Comparison of different imbalance strategies/classifiers based on the selected AAindex_nhkcr features. In Table 3, we compared effectiveness of 17 single/combined feature encodings based on the CNNs architecture with FL function (, , batch_size = 150, epoch = 10). Among the first six types of sequence information, the 500-D EAAC vector presented good results (validation: Sn = 74.24%, Sp = 74.74%; independent: Sn = 76.73%, Sp = 75.83%). For the evolutionary information, except for the BPB descriptor, BLOSUM62 and PSSM both gave higher prediction scores of Sn and Sp than 70%. Regarding physicochemical methods, three encoding approaches of AAIndex, AAIndex_bhKcr and 188D were considered. Compared with the results of whole AAIndex features, AAIndex_nhKcr displayed more exciting results of 75.24% Sn, 78.73% Sp, 78.14% Acc, 0.44 MCC, 0.85 AUC over validation, and 81.43% Sn, 73.36% Sp, 74.80% Acc, 0.44 MCC, 0.84 AUC over independent test. As supplements, we also investigated three types of embedded features of BiLSTM, TAPE_BERT, and W2V using the eFeature toolkit. However, these features all presented unsatisfied scores of 60%∼70%. The combined features in the nhKcr predictor, marked as nhKcr, consisting of one-hot, BLOSUM62, and selected AAIndex, also gave good results [19]. Further incorporation of effective EAAC and PSSM, forming a 3603-D vector didn’t display obvious enhancement. In general, these descriptors of EAAC, BLOSUM62, PSSM, AAIndex, AAindex_nhKcr, nhKcr and nhKCR + EAAC + PSSM presented well recognition scores of approximately 75% based on the CNNs (FL) algorithm (marked in bold). As we all know, AAIndex is one of the classical physiochemistry-related methods considering 531 physicochemical properties in total to induce a 15399-D vector. Furthermore, Chen et al. [19] applied RF as a feature selection approach to construct an effective feature subject with 29 top-ranked properties (marked as AAIndex_nhKcr). After deleting the specific columns with same values for all samples related to K residue at the center, there are (29–1)*29 = 812-D features remained to feed into the CNNs model. Typically, feature extraction methods of PSSM and deep representation learning requires large memory as well as high time cost. In addition, the feature combination process is often accompanied by the curse of dimensionality and information redundancy. By balancing the model performance and computational cost, we ultimately determined to only choose AAIndex_nhKcr features to construct a simple yet efficient computational model. To better illustrate the effectiveness of AAIndex features, we carried out an t-test experiment with another representative descriptors of EAAC. Specifically, we repeated validation tests of these two models ten times and calculated the p-values, respectively. Taking comprehensive metrics of MCC and AUC as examples, corresponding p-values reached 2.19eE-11 and 3.368eE-11. Because p was much smaller than 0.05, we can say that the AAIndex_nhKcr encoder is obviously better than EAAC encodings with ∼100% confidence intervals. In Table 4, we summarized the results of a series of (imbalance) algorithms based on the AAIndex_nhKcr features, including three conventional ML classifiers of RF, SVM, NB, and DL algorithm of CNNs. Moreover, we equipped classical SMOTE and different loss functions (CE, αCE and FL) to these classifiers to explore the effect of data imbalance. Among the data in rows 1–4, we can find the serve overfitting problem for the RF model, corresponding to lowest Sn of 4.24%. For the SVM and CNN (CE) models, imbalance issue seriously influenced the model performance, where the validation Sn scores were only 44.58% and 22.79%, respectively. As for the NB model, although the prediction precision was relatively balanced for samples in two classes, it still cannot meet the experimental requirements. Next four rows of 5–8 indicated the results of typical under-sampling strategy SMOTE implemented with above four classifiers separately (marked as SMOTE + RF, SMOTE + SVM, SMOTE + NB, and SMOTE + CNN (CE)). Overall, the noticeable overestimation problem can be observed, where the average Sn score of 32.85% over the independent test was much lower than that of 82.03% over the validation test. At last, we adapted the weighted CE (marked as αCE) and FL function to the CNN architecture to enhance performance. Once different weights were assigned to the involved classes, the recognition skew on major samples can be significantly eliminated. Accordingly, the average identification scores of αCE- and FL-based CNNs models were sharply increased to 78.78%, contributing to 20%∼80% improvement than the previous results. For a clear comparison of imbalance strategies, radar plot of the independent results of the last six models were depicted in Fig. 4. The SMOTE-based four models were all presented poor results for positive samples. In machine learning research, a very important assumption is that the training and testing datasets share same data distribution characteristics. However, it is a challenging task to keep the data spatial properties unchanged when dealing with the imbalance datasets. Although SMOTE is a classical over-sampling technique, it synthesizes new samples based on the data similarity of the k-nearest neighbor (KNN) samples by (more details can be seen in Eq. (2) and Ref. [21]). In our tests, the balanced training dataset-based model performed well over the 5-CV experiment but badly over independent test, especially for the true Kcr sites. We believed that the new balanced training dataset can’t precisely keep the origin properties by the simple linear interpolation in SMOTE method. Related classifiers were still dedicated to distinguishing negative samples, and needed to be improved to concentrate on more critical positive samples. Fortunately, Sn values were sharply increased to 80.51% for CNN (αCE, cyan line) and 81.43% for CNN (FL, red line). Compared with αCE, FL additionally distinguish the easy- and hard- classified samples by the focusing parameter , which provides more opportunity to enhance the model power and understand data characteristics. We concluded that the CNN scheme implemented with class weights is superior to the application of SMOTE for true Kcr detection.
Fig. 4

Radar plot of independent prediction results of six different imbalance classifiers using selected AAIndex_nhKcr features.

Radar plot of independent prediction results of six different imbalance classifiers using selected AAIndex_nhKcr features.

Model optimization and data visualization

After a series of preliminary investigation, we determined the general framework of the model, which integrates the protein feature AAIndex_nhKcr and the classifier CNN with FL. As illustrated in Imbalance strategies section, FL function concludes two parameters of and . First, is introduced to assigning different class weights for samples of different categories. Thus, we can take values according to the imbalance ratio. As for another focusing parameter (), it is used to distinguish the contribution of easy- and hard-classified samples to the loss function. As depicted in Ref. [50], the contribution of well-classified examples will decrease with increasing. Thus, we gradually increased values to find best one. Comprehensively, we applied the grid search method of and ] to find best values. After comprehensive comparison, we finally obtained the best model with , . Based on the optimized parameters, we carried out standard 5-CV and independent experiments to fairly compare with proposed tools. Meanwhile, we recorded the changes of FL values and accuracy scores with epoch increasing in Fig. 5A and B, respectively. More specific, FL at first dropped rapidly within the epochs of 1 ∼6, then slowly decreased towards a steady value of ∼0.12. We can observe a weak increasing trend in epoch of 16–20, indicating potential overestimation problem. Correspondingly, accuracy also showed an exciting increasement and gradually tended to stabilize. Here, four-fifths (57,000 samples) of the training samples were used to train during the fitting process of 5-fold CV, which means that more than 380 interactions performed in each epoch with the batch size of 150. Comprehensively considering the detection capability and computational costs, we finally used the model with epoch of 16 as our final model (marked by the pale purple area in Fig. 5).
Fig. 5

The changes of FL (A) and Accuracy (B) with increasing epoch (1 ∼ 20) over 5-CV and independent experiments.

The changes of FL (A) and Accuracy (B) with increasing epoch (1 ∼ 20) over 5-CV and independent experiments. Finally, our model was constructed using the whole training data with selected 821-D AAIndex_nhKcr features based on the CNNs method implemented with FL loss (, , epoch = 16, batch_size = 150). Fig. 6 plotted the evaluated metrics of optimized model (A. Sn, Sp, Acc, MCC, Pre, F1; B. ROC curves; C. PR curves). More specific, 5-CV illustrated the results of Sn = 77.13%, Sp = 78.26%, MCC = 0.45, AUC = 0.85 and AUPRC = 0.52, and correlated independent results of Sn = 81.46%, Sp = 74.23%, MCC = 0.45, AUC = 0.87 and AUPRC = 0.52. It is found that 5-CV and independent tests gave comparable and balanced results. Thus, we believed that our model is an efficient tool to detect existed Kcr sites.
Fig. 6

Evaluated metrics of optimized model over 5-CV and independent tests (A. Sn, Sp, Acc, MCC, Pre, F1; B. ROC curves; C. PR curves).

Evaluated metrics of optimized model over 5-CV and independent tests (A. Sn, Sp, Acc, MCC, Pre, F1; B. ROC curves; C. PR curves). At last, we visually analyzed the distribution characteristics of the training samples using the popular dimension reduction method UMAP. Because of more than 70,000 samples involved in the training dataset, we randomly selected 10% of all sequences with the same IR of 1:5 to perform clear illustration. Fig. 7A plotted the UMAP results of the original 812-D AAIndex protein features, i.e., the input data for the convolution model. It can be observed that the positive (red dot) and negative samples (purple dot) were strongly mixed together and evenly scattered throughout the entire feature space. Furthermore, we extracted the output of hidden fully connected layer (“Dense1” in Table 2) formed 200-D hidden attributes. Similarly, Fig. 7B displayed the UAMP results of CNN-learned hidden features. Remarkably, the positive samples (red dot) were mainly concentrated in the lower-left area, whereas the negative samples in the opposite upper-right area. Compared with original AAIndex features, the mined patterns by a series of convolution, pooling and dropout operations are more discriminative to distinguish Kcr and non-Kcr proteins, which well proved/reflected the effectiveness of the CNNs framework as a feature extraction approach to mine hidden distinctive attributes from inputted simple categories.
Fig. 7

UMAP visualization results of training samples, where the protein sequences were formulated by the original AAIndex features (A) and mined attributes outputted from the constructed CNNs architecture (B).

UMAP visualization results of training samples, where the protein sequences were formulated by the original AAIndex features (A) and mined attributes outputted from the constructed CNNs architecture (B).

Comparison of different tools and discussion

Only two state-of-the-art computational tools, nhKcr and DeepKcrot, are concentrated on human nonhistone Kcr sites. Therefore, Table 5 compared the prediction performance of iKcr_CNN with these two tools. Of note, nhKcr and iKcr_CNN used the same datasets to train model. Despite ∼12% drop in Sp of iKcr_CNN than that of nhKcr, the rapid increase in Sn from ∼60% to ∼77% provides greater opportunities to identify hidden real true Kcr sites, which is more crucial for the future application. In terms of DeepKcrot, it illustrated ∼37% skew on false Kcr sequences compared to true Kcr sites. As can be seen in Table 1, the datasets used in DeepKcrot are different from this work. Thus, we reperformed the independent test using their testing dataset (containing 1,483 Kcr and 16,497 non-Kcr segments), which can not only check our model generalizability but also be treated as a more fair comparison between DeepKrot and iKcr_CNN. We found that the prediction efficiency of positive samples are sharply improved from 52.40% to 85.7, corresponding to 33.3% improvements, which well reflected the precise identification of true positive samples in our model. Related MCC, AUC, Pre and F1 separately achieved 0.36, 0.87, 23.37% and 36.72%. Additionally, we further calculated 95% confidence interval results of 5-CV experiments and listed in 3rd row: Sn = 77.13% ± 4.51%, Sp = 78.26% ± 2.46%, Acc = 78.06% ± 1.47%, MCC = 0.45 ± 0.02, AUC = 0.86 ± 0.00, etc. Taking Sn as an example, the average value is 77.13% and will be in the range of 72.62%~81.64% with 95% probability. Similarly, Sp will be located in the range of 75.8%∼80.72% with 95% probability. In summary, relative to other two tools, iKcr_CNN presented well-balanced prediction results for samples belonging to different classes, especially for the positive samples of interest to biologists. Therefore, it is an efficient bioinformatics tool with well robustness and generalizability, and expected to offer reliable guidance for future researches on associated molecular mechanisms, disease treatments, and drug developments.
Table 5

Comparison of state-of-the-art tools for human nonhistone Kcr sites, where “±” indicated 95% confident interval results.

ToolsFeaturesClassifierResultsSn (%)Sp (%)Acc (%)MCCAUCPreF1
nhKcrBE, AAIndex, BLOSUM62CNN5-CV62.8690.0085.400.510.88
Inde58.9090.0084.330.480.88
iKcr_CNNAAIndexCNN (FL)5-CV77.13 ± 4.5178.26 ± 2.4678.06 ± 1.470.45 ± 0.020.86 ± 0.0041.83 ± 1.7153.89 ± 1.35
Inde81.4674.2375.530.450.8541.0154.55
DeepKcrotWECNN5-cv53.7090.0087.100.340.86
Inde52.4090.0086.900.340.86
Comparison of state-of-the-art tools for human nonhistone Kcr sites, where “±” indicated 95% confident interval results. It should be noted that these three predictors were all built on the frame of CNNs. Neglecting the protein representations and hyperparameters setting, the biggest difference is the implement of FL function instead of CE to guide optimization. The prediction gap (∼40%) between Sn and Sp in previous tools caused by the imbalance challenge was successfully narrowed to be negligible. During the construction process, we have tried to take into account many factors, including well-known feature extraction methods, feature incorporation, complex feature selection strategies (i.e., F-score-based feature ranking followed by incremental feature selection (IFS)), grid search of hyperparameters in CNN architecture (the number of layers, and involved number of filters, kernel size, dropout rate, learning rate, etc., see Table 2). The evaluated metrics was basically swung at 75%, and hard to obtain remarkable enhancement. Therefore, we implied that, with the appropriate/necessary imbalance strategy, the most basic and crucial method to increase performance is to find discriminative protein representations.

Construction of the iKcr_CCN web server

The essential purpose of developing prediction models is to assist biological/experimental researchers find potential modified sites. However, local implementation is cumbersome and challenging, especially the configuration of the DL environment of TensorFlow, Keras, etc. Thereby, we built a user-friendly web server to predict Kcr sites online (accessible at ). As illustrated in Fig. 8, four underlined buttons of “Introduction”, “Prediction”, “Download” and “Contact” are separately linked to brief sever introduction, Kcr prediction interface, resource download and contact details. When predicting potential sites (green box), the user only needs to prepare and input the query proteins in FASTA format with a fixed window length of 29 (the lysine (K) is in center and 14 acids on both sides, see 2.1) or directly upload the FASTA file if large amounts of proteins. After submitting, the predicted results will be illustrated in tabular form, including sample ID, protein sequence, predicted label with default threshold of 0.5 as well as specific probabilities predicted to be Kcr sites a few seconds later (abbreviated as “ID”, “Label”, “Sequence” and “Probability”, respectively). It is noted that the output box can only display up to 20 samples. All prediction results and related AAIndex features can be quickly download below in csv format (↓ icon). All benchmark datasets and python codes can be obtained on the “Download” page as well as the public Git-Hub platform .
Fig. 8

Screenshot of the iKcr_CNN web-server (available at ), where the user can input or upload the query proteins (see upper panel) and then results can be presented or download (see lower panel) a few minutes later.

Screenshot of the iKcr_CNN web-server (available at ), where the user can input or upload the query proteins (see upper panel) and then results can be presented or download (see lower panel) a few minutes later.

Conclusion

Precise identification of Kcr sites can facilitate the research progress of the involved modification mechanisms, cellular activities, and medical developments. Although two state-of-the-art prediction tools have been proposed, they presented low efficiencies of positive samples that experimenters are more interested in. In this work, we built a novel computational tool in the deep learning frame, dubbed iKcr_CNN, to predict Kcr site on human nonhistone proteins. To eliminate the prediction preference for major samples caused by imbalance distribution, we implemented the powerful focal loss function as the indicator to optimize the constructed deep learning classifier. Unlike the classical binary cross-entropy (CE), FL not only implements different class weights but also distinguishes the well- and hard-classified samples to fairly treat the positive and negative subsets. Based on only 29 important physicochemical patterns in the AAIndex descriptor, this concise model ultimately demonstrated 77.31% prediction score for true Kcr sites, and 78.62% for false Kcr sites with AUC value of 0.86 over 5-fold CV, as well as 77.87% for true Kcr sites, 76.61% for false Kcr sites with AUC of 0.85 over independent tests. Compared to the previous tools, it reported the highest prediction precision of positive samples and showed more balanced performance, which indicated the high efficiency of FL on data imbalance issue. Furthermore, we built an online web-server, named iKcr_CNN, to help scientific researchers conveniently perform Kcr detection (available at ). We anticipate that the proposed model is a reliable tool to detect potential Kcr sites, and provides bioinformatics guidance for further laboratory researches. Although our model showed balanced prediction scores for positive and negative samples, there is still some room existed to improve its performance. A series of complex optimization experiments involving different protein representation approaches, feature combination/selection, and classification algorithms, can only bring ∼5% improvement. Therefore, we believed that developing effective protein representation method is still the fundamental/significant way to build advanced bioinformatics tools. Undeniably, designing novel and effective representation methods is a challenging and meaningful task, which mainly depends on the profound realization of modification mechanisms and biological sequences, combined with certain statistical knowledge. Besides, DL-based multiple mainstream techniques, such as the Autoencoder (AE), transfer learning (TL), generative adversarial network (GANs), are also the promising direction for us to mine more powerful features. In addition, previous studies were basically concentrated on improving the overall prediction performance during the modelling process. It is also valuable to give precise explanation for the specific example which is detected by FL but not detected by other methods. It hopefully brings new insights to the difference of Kcr and non-Kcr samples in sequence level, even helps to uncover the mechanisms of modification. In addition, it is highly encouraged to implement several interpretable algorithms to help biologist understand the model and analyze the contribution of individual features to the predicted results, such as the Shapley Additive explanation (SHAP) [20], Local Interpretable Model-agnostic Explanations (LIME) [22], etc. The mentioned above will be important aspects for us to enhance model performance and conduct more in-depth research in the future.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
  29 in total

1.  MusiteDeep: a deep-learning based webserver for protein post-translational modification site prediction and visualization.

Authors:  Duolin Wang; Dongpeng Liu; Jiakang Yuchi; Fei He; Yuexu Jiang; Siteng Cai; Jingyi Li; Dong Xu
Journal:  Nucleic Acids Res       Date:  2020-07-02       Impact factor: 16.971

2.  Long short-term memory.

Authors:  S Hochreiter; J Schmidhuber
Journal:  Neural Comput       Date:  1997-11-15       Impact factor: 2.026

3.  Sequence representation approaches for sequence-based protein prediction tasks that use deep learning.

Authors:  Feifei Cui; Zilong Zhang; Quan Zou
Journal:  Brief Funct Genomics       Date:  2021-03-02       Impact factor: 4.241

4.  Deep-Kcr: accurate detection of lysine crotonylation sites using deep learning method.

Authors:  Hao Lv; Fu-Ying Dao; Zheng-Xing Guan; Hui Yang; Yan-Wen Li; Hao Lin
Journal:  Brief Bioinform       Date:  2021-07-20       Impact factor: 11.622

5.  BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches.

Authors:  Bin Liu; Xin Gao; Hanyu Zhang
Journal:  Nucleic Acids Res       Date:  2019-11-18       Impact factor: 16.971

6.  nhKcr: a new bioinformatics tool for predicting crotonylation sites on human nonhistone proteins based on deep learning.

Authors:  Yong-Zi Chen; Zhuo-Zhi Wang; Yanan Wang; Guoguang Ying; Zhen Chen; Jiangning Song
Journal:  Brief Bioinform       Date:  2021-11-05       Impact factor: 11.622

7.  Hierarchical classification of protein folds using a novel ensemble classifier.

Authors:  Chen Lin; Ying Zou; Ji Qin; Xiangrong Liu; Yi Jiang; Caihuan Ke; Quan Zou
Journal:  PLoS One       Date:  2013-02-20       Impact factor: 3.240

8.  Characterization and identification of lysine crotonylation sites based on machine learning method on both plant and mammalian.

Authors:  Rulan Wang; Zhuo Wang; Hongfei Wang; Yuxuan Pang; Tzong-Yi Lee
Journal:  Sci Rep       Date:  2020-11-24       Impact factor: 4.379

9.  CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors:  Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal:  Bioinformatics       Date:  2012-10-11       Impact factor: 6.937

10.  Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications.

Authors:  Zitao Song; Daiyun Huang; Bowen Song; Kunqi Chen; Yiyou Song; Gang Liu; Jionglong Su; João Pedro de Magalhães; Daniel J Rigden; Jia Meng
Journal:  Nat Commun       Date:  2021-06-29       Impact factor: 14.919

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.