Literature DB >> 29380690

Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation.

Chun Li1,2,3, Jialing Zhao2, Changzhong Wang2, Yuhua Yao1.   

Abstract

AIM AND
OBJECTIVE: The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information.
METHODS: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically.
RESULTS: By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82- 33.85% in terms of F1M.
CONCLUSION: These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

Entities:  

Keywords:  Adjacency matrix; Generalized PseAAC; graph; identification of DNA-binding proteins; phylogenetic analysis; protein sequences.

Mesh:

Substances:

Year:  2018        PMID: 29380690      PMCID: PMC5930480          DOI: 10.2174/1386207321666180130100838

Source DB:  PubMed          Journal:  Comb Chem High Throughput Screen        ISSN: 1386-2073            Impact factor:   1.339


INTRODUCTION

DNA-binding proteins (DNA-BPs) are very important functional proteins in a cell. These proteins play vital roles in various cellular processes, including DNA replication, transcription, regulation of gene expression, packaging, and other activities associated with DNA [1-5]. It is therefore substantially important to distinguish DNA-BPs from non-DNA-binding proteins (NBPs). In the past, many experimental and computational techniques have been developed for identifying DNA-BPs. Experimental techniques can provide a clear-cut answer to a query protein. However, the experimental methods are cost-intensive and time-consuming, and thus impractical for large datasets [3-7]. Computational methods can be broadly divided into two categories: structure-based method and sequence-based method. The former can discriminate DNA-binding and non-binding proteins with high accuracy, but these methods can't be employed in high throughput annotation, as they require the structure information of a query protein [1]. Though tremendous progress has been achieved in experimental determination of protein structures in the past five decades, it can't keep pace with the explosive growth of sequence information resulting from modern sequencing technology [8]. Yet as suggested by Anfinsen [9], proteins contain within their amino acid sequences enough information to determine their native conformation. Therefore, it is more promising to use sequence-based methods to identify DNA-BPs. One of the core issues to the sequence-based methods is how to characterize protein sequences and harvest the fruits hidden in them. The most typical approach is using the amino acid composition (AAC) to formulate a protein sequence. Owing to its simplicity, the AAC model was widely applied in a number of earlier statistic-based methods. However, as pointed out in Ref [6], if we denote by the counts of 20 standard amino acids in a protein sequence, then we can see that there are a total of different sequences/strings possessing the same AAC. The reason is that AAC model neglects the order relation among elements of a sequence. To overcome this drawback, the concept of pseudo amino acid composition (PseAAC, or Chou’s PseAAC) was proposed [10-18]. The essence of PseAAC is that it not only covers AAC, but also contains additional order-correlated factors along a protein sequence. Another popular way for sequence analysis is to convert the protein primary sequence over 20 amino acids into a reduced one. The earliest and simplest reduction was the well-known HP model, in which 20 standard amino acids are divided into two types, hydrophobic (H) (or non-polar) and polar (P) (or hydrophilic). On the basis of the classic model, a detailed HP model was introduced by dividing the polar class into three subclasses: positive polar, uncharged polar and negative polar [19]. In addition, a few five-group classifications of amino acids were presented for practical purposes [20-23]. By considering property-based triples, Li et al. [6] put forward a six-letter model of amino acids. Also based on three physical-chemical properties of amino acids, Yao et al. [24] mapped the 20 standard amino acids to eight vertices of a cube with the center of origin, and thus an eight-group model of amino acids is obtained. Motivated by the work mentioned above, we propose a generalized PseAAC which is grounded on a three-letter model and 2-D graphical representation of a protein sequence. We summarize the main work of this paper as follows: In section 2, we briefly introduce five datasets used in this study. In section 3, on the basis of two important physicochemical properties of amino acids, we cluster the 20 standard amino acids into three groups. By assigning to each group a representative symbol, we transform a protein sequence into a three-letter sequence. Then a 2-D graph without loops and multiple edges and its geometric line adjacency matrix are obtained. A sequence-derived feature vector of dimension (25+) is thus constructed to characterize a protein sequence. Our scheme is similar to, but obviously different from that of PseAAC. In section 4, we apply the presented feature vector to compare -globin proteins of 17 species and 72 spike proteins of coronaviruses respectively. Also, we develop a SVM (support vector machine) model using the generalized PseAAC to identify DNA-binding and non-binding proteins on three datasets. Experiment results show that the presented method outperforms the existing methods including DNAbinder [1], DNA-Prot [2], iDNA-Prot [3] and enDNA-Prot [4]. Finally, conclusions are given in section 5.

Datasets

In this study, the following five datasets are used. For convenience, they are denoted by BetaSet, CoVSet, DNASet, DNAeSet and DNAiSet, respectively.

BetaSet

The dataset called BetaSet is composed of -globin protein of 17 species: Human (ALU64020), Gorilla (P02024), Chimpanzee (P68873), Cattle (CAA25111), Banteng (BAJ05126), Goat (AAA30913), Sheep (ABC86525), European hare (CAA68429), Rabbit (CAA24251), House mouse (ADD52660), Western wild mouse (ACY03394), Spiny mouse (ACY03377), Norway rat (CAA29887), Opossum (AAA30976), Guttata (ACH46399), Gallus (CAA23700), Muscovy duck (CAA33756). This dataset is used to determine the adjustable parameters in a feature vector.

CoVSet

This dataset consists of 72 spike proteins of coronaviruses (CoVs), 23 of which are MERS-CoVs, and 30 are SARS-CoVs. CoVs can be divided into three groups according to serotypes. Group alpha (formerly known as CoV-1) and group beta (formerly CoV-2) contain mammalian viruses, while group gamma (formerly CoV-3) contains only avian viruses. The name, accession number, and abbreviation of the 72 sequences are listed in Table . According to the existing taxonomic groups, sequences 1-5 belong to the first group, sequences 6-8 belong to the third group, and the remainings belong to the second group.

DNASet

This is a benchmark dataset created in 2007 by Kumar et al. [1]. It contains 396 sequences, 146 of which are DNA-BPs (positive samples), and 250 NBPs (negative samples). In both the positive and the negative sets, the sequence similarity between any two proteins is not more than 25%.

DNAiSet

This dataset was also generated by Kumar et al. [1] which is based on the work of Wang and Brown [25]. It originally contains 92 DNA-BPs and 100 NBPs. In order to avoid overestimating a given method, those sequences having sequence similarity with DNASet were removed by Xu et al. [4], and the final dataset is composed of 82 DNA-BPs and 100 NBPs.

DNAeSet

As an expanded benchmark dataset, DNAeSet was constructed in 2014 by Xu et al. [4]. According to a sequence filter criteria which is identical to DNASet, they added a number of NBPs to DNASet, and the total number of NBPs is 2125. By removing the sequence which has sequence identity with DNAiSet, the current version of DNAeSet has 146 DNA-BPs and 1710 NBPs.

Methods

Three-letter Sequence of Protein Sequence and its 2-D Graphical Representation

Isoelectric point (pI) and relative distance (RD) are two important physicochemical properties of the 20 standard amino acids [26-28]. Their original numerical values are listed in Table . As can be seen from this table, the values of (isoelectric point) are in the range [2.97, 10.76], while (relative distance) varies between 1469 and 3355. Therefore, the normalization of these values is needed. Here, we scale them into the interval [0,1] by the formulary below: , , (1) The corresponding values are listed in Table . The last row in this table gives the average values. For the i-th amino acid , if , then we label it by “+”, otherwise we will label it by “-”. Similarly, if property is considered, the second label for amino acid can be obtained. In this way, each of the 20 standard amino acids has a label pair. In Table , the corresponding labels are also listed. Amino acids with a same label pair are viewed as members of a same group. Thus, the 20 standard amino acids are distributed to the following groups: GI={ A,Y,V,Q,M,L,I,E }, GII={ C,W,S,N,G,F,D }, GIII={ H,T,R,P,K }. For each group, the first amino acid is used to stand for the group. Thus the three groups have three representative letters, they are A, C and H, respectively. The value for the property of a group is defined as the average value for the property of all members in the group. In the left-hand side of Table , we list the corresponding values of the three groups. Obviously, each group can be viewed as a 2-D vector. In order to make the vectors of the three groups have unit length, we further normalize them to be unit vectors, and list the normalized values () in the right-hand side of Table . In Fig. (, we show the 2-D map of the 20 standard amino acids according to the classification above.
Fig. (1)

The 2-D map of the 20 standard amino acids.

By substituting each amino acid with its representative letter, a protein primary sequence is reduced into a three-letter sequence. For example, the three-letter sequence of the sequence segment EKAAVTGFWGKVKVDEVGAEA is AHAAAHCCCCHAHACAACAAA. To obtain the graphical representation of a reduced sequence, we start from the origin (0,0) and move in xoy-plane in the direction dictated by Fig. (. In mathematics, one can let be a given three-letter sequence. And then one has a map , which maps S into a plot set. Explicitly, , and is given by where, T represents the transpose of a matrix, (j=1,2) represents the j-th component of the unit vector corresponding to (cf. Fig. and Table ). Connecting all points of the plot set in turn, a 2-D curve is drawn. In Fig. (, we show the 2-D graphical representation of sequence AHAAAHCCCCHAHACAACAAA. It is not difficult to find that the 2-D graphical representation has no degeneracy, and thus is a simple graph, that is, a graph without loops and multiple edges.
Fig. (2)

The 2-D graphical representation.

(25 + ) Dimensional Feature Vector

In this section, we give a numerical characterization of a protein sequence that will facilitate quantitative comparisons of protein sequences. As is known, once a graphical representation is given, it can be transformed into some structural matrices, such as the matrices ED, GD, M/M, and L/L [6, 24, 29-37]. Here we employ the L/L matrix. L/L is a nonnegative symmetric matrix whose off-diagonal entries are defined as a quotient of the Euclidean distance between two vertices of the graph and the sum of geometrical lengths of edges between the two vertices. By definition all diagonal elements are zero. Obviously, the entries in a L/L matrix are less than or equal to one. The higher order kL/kL matrix is the matrix whose (i,j)-entry is . As the exponent k approaches positive infinity, kL/kL converges to a (0,1) matrix (denoted by bL/bL). With respect to the proposed 2-D graph, [bL/bL]ij=1 if and only if the two corresponding vertices lie on a straight line in the curve, including the cases of adjacency and non-adjacency. In this sense, we call such a matrix a geometric line adjacency matrix (GLAM), or simply a generalized adjacency matrix (GAM), generated by a graph, and denote it by . The first Zagreb index is a well-known vertex-degree-based molecular structure descriptor. This index was first time considered by Gutman and Trinajstic about 45 years ago, and since then discussed and used in numerous studies (see [38-40] and the references cited therein). The first Zagreb index is defined as (2) where du denotes the degree (=number of first neighbors) of the vertex u in graph G. If G is a simple graph (i.e. without loops and multiple edges), Zg1 can be also obtained directly from its adjacency matrix since the row-sums of this matrix are equal to degrees of the corresponding vertices. It should be mentioned that the Zagreb index gives greater weights to inner vertices and edges than to outer vertices and edges of a graph [38]. One way to amend it is to insert inverse values of the vertex-degree into Eq(2), and thus the modified Zagreb index has been proposed [38]: Clearly, mZg1 gives greater weights to outer vertices/edges than to inner ones in a graph. At the same time, on the basis of our geometric line adjacency matrix, we can count the vertex-pair with generalized adjacency relationship. It should be noted that, in our case, the 'neighbors' include not only the conventional neighbors, i.e. the first neighbors, but also the second neighbors, the third neighbors, and so on. We call the corresponding number of graph G a line-adjacency index, and denote it by La(G). Then we have a graph-based index: For a symmetric matrix, eigenvalue-based indices, such as the leading eigenvalue [29-33, 35] and the graph energy [17], are often used as the matrix invariants. Moreover, in our previous paper [41], an alternative invariant called ‘ALE-index’ was proposed. The ALE-index is defined by the following formula: (4) where L is the order of the matrix, and are the m1- and F-norms of a matrix respectively. In order to reduce variations caused by comparison of matrices with different sizes, we consider a normalized ALE-index instead of . For convenience, we denote this matrix-based index by . In addition, with respect to three-letter sequence , we define a coupling mode function by , (n=1, 2) (5) where P1 and P2 are values for properties of the corresponding representative letter (group), integer k represents the counted rank (or tier) of the coupling mode. Then, following the similar procedures in [10, 11], we can extract global sequence-order information of the three-letter sequence S by , , . (6)where is called the k-th tier correlation factor. Clearly, reflects the coupling mode between the most contiguous elements along three-letter sequence S, is the coupling mode between the second most contiguous, the third most contiguous, and so forth. Furthermore, if the respective counts of the three representative letters (A, C and H) in sequence S are , respectively, then we can obtain a so-called group composition (GC): where, denotes the size of a group (set). Consequently, elements are derived, which reflect the information about the reduced sequence and, particularly, the 2-D graphical representation. By combining these elements with the conventional amino acid composition (AAC), a dimensional feature vector can be constructed to numerically characterize a protein sequence: , (7) where (8) Here, are frequencies of occurrence of the 20 standard amino acids in a protein sequence, and are weight factors. As will be described later in detail, the four adjustable parameters in Eqs (7) and (8) can be determined by a set of known samples. Roughly speaking, the vector contains the feature of AAC, and the information beyond AAC as well, which is similar to Chou’s PseAAC in form. Therefore, we call such a vector formulated by Eqs (7) and (8) the generalized PseAAC of a protein sequence.

Results and Discussion

In this section, we will discuss the use of the generalized PseAAC. As can be seen from Eqs (7) and (8), the present mathematical descriptor contains four uncertain parameters: , w1, w2 and w3. Here represents the total number of correlation ranks counted (cf. Eq(6)), which is an integer. Generally speaking, the greater the value of , the more sequence-order effects will be incorporated. However, if the value is too large, it might cause the overfitting problem or ‘high dimension disaster’ [15], therefore, we endeavour to limit the value of to a small integer. In this study, the five datasets (BetaSet, CoVSet, DNASet, DNAeSet and DNAiSet) are arranged into two groups: one contains BetaSet, the other includes the rest. The first group is used for determining the four adjustable parameters, and the second group for testing purpose.

Parameter Determination

According to the method mentioned above, we first associate each of 17 protein sequences in BetaSet with a dimensional vector (cf. Eqs (7) and (8)), and then calculate the pair-wise Euclidean distance between any two of the 17 protein sequences via their m-D vectors. Thus a real symmetric matrix is obtained. On the basis of the achieved distance matrix , a UPGMA tree is constructed using MEGA4 package. The result will depend on values of the rank and the three weight factors. It is found that when , , and , the three non-mammals (Muscovy duck, Gallus and Guttata) form a separate branch and stay outside of the mammals. Moreover, in the subtree of mammals, primate species (Human, Chimpanzee, Gorilla) are grouped closely. Also, rodent species (Norway rat, Spiny mouse, House mouse, Western wild mouse) and lagomorph species (Rabbit, European hare) are situated at independent branches, respectively. While Goat, Sheep, Cattle and Banteng appear to cluster together (Fig. ). This result is analogous to that reported in the literature [6, 29, 30, 35, 36]. Accordingly, the four numerical values are respectively used for the four uncertain parameters, and a 31-D feature vector is thus obtained.

Test I: Phylogenetic Analysis of Coronavirus Spike Proteins

In order to evaluate the effectiveness of our method, we test it by phylogenetic analysis on the CoVSet dataset. Coronaviruses (CoVs) belong to the genus Coronavirus of family Coronaviridae [42]. The first coronavirus (HCoV-229E) was isolated from humans in 1965. Until 2003, coronaviruses attracted little interest beyond causing mild upper respiratory tract infections. However, this phenomenon changed dramatically with the emergence of SARS-CoV and MERS-CoV. As of July 2017, 2040 laboratory-confirmed cases of MERS-CoV infection were reported in over 27 countries, and at least 710 individuals have died (crude CFR 34.8%) [43]. Using the above-determined values for parameters , w1, w2, and w3, we calculate the 31-D feature vectors of 72 coronavirus spike proteins and their Euclidean distance matrix; then the corresponding phylogenetic tree (Fig. ) is constructed. Observing Fig. (, we find that the 72 coronavirus spike proteins are clustered into three groups: one contains the five alpha coronaviruses (PEDVC, PEDV, TGEVG, TGEV, and HCoV-229E), the second includes the three gamma coronaviruses (IBV, IBVBJ, IBVC), and the third corresponds to the group beta. A closer look at the subtree of beta coronaviruses shows that MERS-CoVs are clearly clustered together, so it is with SARS-CoVs, while MHV, MHVA, MHVM, MHVP, MHVJHM, BCoV, BCoVE, BCoVL, BCoVM, BCoVQ and HCoV-OC43 are situated at an independent branch. The resulting cluster agrees well with the established taxonomic groups.
Fig. (4)

The relationship tree of 72 coronavirus spike proteins.

Test II: Identification of DNA-binding Proteins

To further assess the effectiveness of the porposed method, we conduct a series of experiments of identification of DNA-binding proteins on three datasets: DNASet, DNAeSet and DNAiSet. Among them, DNASet and DNAeSet serve as training datasets, while DNAiSet serves as an independent testing dataset. Support vector machine (SVM) is employed as the classifier, and R package ‘e1071’ v1.6-8 [44] is used to implement SVM. For a given set of binary-labeled training examples, SVM maps the input space into a higher-dimensional space and seeks a hyperplane to separate the positive samples from the negative ones [25]. The optimal hyperplane maximizes the separation margin between the two classes of training data. The distance measurement between the data points in the high-dimensional space is defined by the kernel function. In this study, we use the radial basis function (RBF) kernel . This model involves two tunable parameters: the kernel width and the penalty parameter C. Prediction performance can be assessed using some quality indices including Accuracy (ACC), Sensitivity (Se), Specificity (Sp), F-measure (F1M) and Matthews correlation coefficient (MCC) [2, 4, 5, 25, 37, 45]: , , , (9) , , , . where TP, TN, FP, and FN are defined as the numbers of true positive, true negative, false positive, and false negative samples obtained from the prediction respectively, while P and R denote Precision value and Recall value, respectively. One can also use the alternative definition by a series of studies published recently [15, 46-48]. The higher the values of these measurements, the better the quality of prediction.

Predictive Performance on Benchmark Dataset

This experiment is made on DNASet itself. To obtain a reliable result with few error, the SVM model on DNASet is established by 5-fold cross-validation (5CV) with 3 runs. Here the 31-D feature vector of a protein sequence serves as the input for SVM. In a 5CV, the positive and negative samples are randomly distributed into five subsets or the so-called folds, and the test is repeated five times. In each of the five iterations, one subset is used as the testing set, while the remaining four subsets are combined together and used to build a classifier (training). The predictions made for the test data instances in all the five iterations yield the final result. The sensitivity, specificity, ACC, MCC and F1M are calculated for each run, and the corresponding results and their average values are listed in Table . As can be seen from this table, we achieve the accuracy (ACC) of 89.65%, with MCC of 0.776 and F1M of 84.91%. This result shows that our SVM model performs well on the benchmark dataset DNASet.

Predictive Performance on Blind Dataset

It is important to examine the performance of the newly developed method on an independent dataset. In this experiment, we establish the classifier with the benchmark dataset DNASet and then test it on the independent dataset DNAiSet. To decide the parameter pair (γ, C), we utilize a systematic grid search for and , where integers i and j are in ranges [-3, 3] and [0, 3], respectively. It is find that and are the optimal values for DNASet. With the best pair (γ, C), DNAiSet is fed to the SVM. As a result, our model correctly predicts 68 out of 82 DNA-BPs and 92 out of 100 NBPs. The ACC arrives at 87.91%, with the MCC, sensitivity, specificity, and F1M of 0.756, 82.93%, 92.00% and 86.07%, respectively (see Table ). This demonstrates that our SVM model performs equally well on independent dataset. For convenience of comparison, results of some existing methods including DNAbinder [1], DNA-Prot [2], iDNA-Prot [3] and enDNA-Prot [4] are also listed in Table . DNAbinder developed by Kumar et al. [1] can extract evolutionary information in form of position specific scoring matrix (PSSM) from the corresponding protein sequence. PSSM-21 and PSSM-400 are two feature vectors generated by means of PSSM, whose dimensions are 21 and 400, respectively. In [1], PSSM-400 based SVM model was mainly used for predicting DNA-BPs. DNA-Prot [2] is a Random Forest based method, in which the feature vector includes sequence information and structure information, such as the composition of 20 standard amino acids, composition of 10 amino acid groups, and secondary structure information predicted from a protein sequence. iDNA-Prot [3] constructs the feature vector via the grey model, and Random Forest is also used as the operation engine. EnDNA-Prot [4] is a predictor which encodes a protein sequence into a feature vector with dimension of 188 and adopts an ensemble classifier constructed with four types of machine learning classifiers. All these methods are tested on the same datasets to make an unbiased comparison with our method. Observing Table , we can see that the current approach outperforms other methods by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. This result indicates that our method achieves highly comparable performance.

Impact of the Number of Negative Samples

When the size of positive samples is comparable to that of negative samples, many machine learning algorithms should have better performance. However, in real life, the number of non-binding proteins is much greater than that of DNA-BPs, i.e., . (10) In this case, the frequency of NBPs is generally much greater than that of the binding ones in the predictions, that is, . (11) Eqs (10) and (11) lead to that the value of ACC defined by Eq (9) tends towards 1. To solve this problem, instead of using the definition of ACC in Eq (9), here we use the alternative definition [49, 50]: . (12) In order to analyze the influence of the number of negative samples in a benchmark dataset on the predictive performance of the current method, we construct a series of subsets of DNAeSet and use them as training set in turn, while DNAiSet is always used as the testing set. Each subset contains all the 146 DNA-BPs and a part of NBPs in DNAeSet. In detail, if the set of NBPs in is denoted by , k=1, 2, ..., then consists of 250 NBPs randomly selected from DNAeSet. And is obtained by adding 50 NBPs to , until 1700 NBPs are contained in it. For each subset , k=1, 2, ..., 30, we develop the SVM model by 5CV with 3 runs. The results averaging over the three runs are given in Fig. (. From Fig. ( we can see that the curves of ACC and acc visibly split with each other when n, the size of , is larger. With increasing of n, ACC increases rapidly, while acc tends to be steady. The value of ACC seems higher and higher on the surface, but it cannot correctly reflect the performance because it is nothing but a false appearance.
Fig. (5)

The influence of the number of negative samples.

In order to show the advantage of their method, Xu et al. [4] created a dataset called expanded benchmark dataset1100 with all the 146 positive samples and 1100 negative samples in DNAeSet, which is employed as another training dataset to evaluate the predictive performance on the independent dataset DNAiSet. For convenience of comparison, we also select the expanded benchmark dataset to establish the classifier and test it on DNAiSet. Repeating this procedure five times, the average results are given in Table (the first row). Results obtained by the other four methods (DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot) trained on the expanded benchmark dataset with n=1100 are also listed in Table . From this table we see that the overall accuracy of our method is about 92%, with MCC of 0.84 and F1M of 91.24%, which outperforms other methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82-33.85% in terms of F1M. This suggests that our method performs well on unbalanced datasets. Based on two important physicochemical properties, 20 standard amino acids were distributed into three groups, and to each of which a representative symbol was assigned. By replacing each amino acid with its representative letter, a protein primary sequence was converted into a three-letter sequence, which can be viewed as a coarse-grained description of the protein primary sequence. On the basis of the three-letter sequence, a graph without loops and multiple edges was obtained. By taking the advantage of the 2-D graph, we constructed a geometric line adjacency matrix (GLAM) and then the corresponding ALE-index, the line-adjacency index, the first Zagreb index and its modification were calculated. In addition, order-correlated factors were extracted via the reduced sequence. By combining these elements with the frequencies of occurrence of 20 standard amino acids and their three representative letters, a generalized PseAAC model of a protein sequence was constructed. On five popular datasets, the proposed method was tested by phylogenetic analysis and identification of DNA-binding proteins. The results illustrated the better performance of our method.

Consent for Publication

Not applicable.
Table 1

The accession number, name and abbreviation for 72 coronavirus spike proteins.

No. Accession number Virus name/strain Abbreviation
     1.CAB91145Transmissible gastroenteritis virus, genomic RNATGEVG
     2.NP_058424Transmissible gastroenteritis virusTGEV
     3.AAK38656Porcine epidemic diarrhea virus strain CV777PEDVC
     4.NP_598310Porcine epidemic diarrhea virusPEDV
     5.BAL45637Human coronavirus 229EHCoV-229E
     6.AAP92675Avain infectious bronchitis virus isolate BJIBVBJ
     7.AAS00080Avain infectious bronchitis virus strain Ca199IBVC
     8.NP_040831Avain infectious bronchitis virusIBV
     9.NP_937950Human coronavirus OC43HCoV-OC43
     10.AAK83356Bovine coronavirus isolate BCoV-ENTBCoVE
     11.AAL57308Bovine coronavirus isolate BCoV-LUNBCoVL
     12.AAA66399Bovine coronavirus strain MebusBCoVM
     13.AAL40400Bovine coronavirus strain QuebecBCoVQ
     14.NP_150077Bovine coronavirusBCoV
     15.AAB86819Mouse hepatitis virus strain MHV-A59C12 mutantMHVA
     16.YP_209233Murine hepatitis virus strain JHMMHVJHM
     17.AAF69334Mouse hepatitis virus strain Penn 97-1MHVP
     18.AAF69344Mouse hepatitis virus strain ML-10MHVM
     19.NP_045300Mouse hepatitis virusMHV
     20.AAU04646SARS coronavirus civet007civet007
     21.AAU04649SARS coronavirus civet010civet010
     22.AAU04664SARS coronavirus civet020civet020
     23.AAV91631SARS coronavirus A022A022
     24.AAV49730SARS coronavirus B039B039
     25.AAP51227SARS coronavirus GD01GD01
     26.AAS00003SARS coronavirus GZ02GZ02
     27.AAP30030SARS coronavirus BJ01BJ01
     28.AAP13567SARS coronavirus CUHK-W1CUHK-W1
     29.AAP37017SARS coronavirus TW1TW1
     30.AAR87523SARS coronavirus TW2TW2
     31.BAC81348SARS coronavirus TWH genomic RNATWH
     32.BAC81362SARS coronavirus TWJ genomic RNATWJ
     33.AAQ01597SARS coronavirus Taiwan TC1TaiwanTC1
     34.AAQ01609SARS coronavirus Taiwan TC2TaiwanTC2
     35.AAP97882SARS coronavirus Taiwan TC3TaiwanTC3
     36.AAP13441SARS coronavirus UrbaniUrbani
     37.AAP72986SARS coronavirus HSR 1HSR1
     38.AAQ94060SARS coronavirus ASAS
     39.AAP94737SARS coronavirus CUHK-AG01CUHK-AG01
     40.AAP94748SARS coronavirus CUHK-AG02CUHK-AG02
     41.AAP94759SARS coronavirus CUHK-AG03CUHK-AG03
     42.AAP30713SARS coronavirus CUHK-Su10CUHK-Su10
No.Accession numberVirus name/strainAbbreviation
     43.AAP33697SARS coronavirus Frankfurt 1Frankfurt1
     44.AAR14803SARS coronavirus PUMC01PUMC01
     45.AAR14807SARS coronavirus PUMC02PUMC02
     46.AAR14811SARS coronavirus PUMC03PUMC03
     47.AAP41037SARS coronavirus TOR2TOR2
     48.AAP50485SARS coronavirus FRAFRA
     49.AAR23250SARS coronavirus Sin01-11Sino1-11
     50.AHX00731MERS coronavirusKFU-HKU1
     51.AHX00711MERS coronavirusKFU-HKU13
     52.AHX00721MERS coronavirusKFU-HKU19Dam
     53.AIY60578MERS coronavirusAbu-Dhabi_UAE_9
     54.AIY60568MERS coronavirusAbu-Dhabi_UAE_33
     55.AIZ74417MERS coronavirusHu-France(UAE)-FRA1
     56.AIZ74433MERS coronavirusHu-France-FRA2
     57.ALJ54502MERS coronavirusHu/Qunfidhah-KSA-Rs1338
     58.AKN24821MERS coronavirusKFMC-1
     59.AKN24830MERS coronavirusKFMC-7
     60.ALJ76282MERS coronavirusHu/Taif,KSA-2083
     61.ALJ76281MERS coronavirusHu/Taif,KSA-5920
     62.ALJ54493MERS coronavirusHu/Makkah-KSA-728
     63.ALB08267MERS coronavirusKOREA/Seoul/014-1
     64.ALB08278MERS coronavirusKOREA/Seoul/014-2
     65.ALR69641MERS coronavirusD2731.3
     66.AKQ21055MERS coronavirusADFCA-HKU1
     67.AKQ21064MERS coronavirusADFCA-HKU2
     68.AKQ21073MERS coronavirusADFCA-HKU3
     69.ALA50001MERS coronaviruscamel/Taif/T68
     70.ALA50012MERS coronaviruscamel/Taif/T89
     71.ALT66813MERS coronavirusJordan_1
     72.ALT66802MERS coronavirusJordan_10
Table 2

The original numerical values for properties of the 20 standard amino acids.

Amino acid (AA) pI a ( P10 ) RD a ( P20 )
ACDEFGHIKLMNPQRSTVWY6.025.022.973.225.485.977.596.029.745.985.755.426.305.6510.765.686.535.975.895.6618893355220918121916207815071765179718221689194317201538169720001469168023171787

a: taken from [26-28]

Table 3

The scaled values for properties of the 20 standard amino acids.

AA P1* lable1 P2* Lable2
ACDEFGHIKLMNPQRSTVWY0.39150.263200.03210.32220.38510.59310.39150.86910.38640.35690.31450.42750.34401.00000.34790.45700.38510.37480.3453------+-+---+-+-+---0.22271.00000.39240.18190.23700.32290.02010.15690.17390.18720.11660.25130.13310.03660.12090.281500.11190.44960.1686-++-++-----+---+--+-
Pn¯0.39940.2283
Table 4

The values for properties of the three groups.

Group Representative P1' P2' P1 P2
GIGIIGIIIACH0.3291 0.2868 0.66930.14780.41930.08960.91220.56460.99120.40970.82530.1327
Table 5

The results of 5CV for 3 runs.

Test 1 2 3 Average
Se(%)78.7778.7779.4579.00
Sp(%)96.0096.0095.6095.87
Acc(%)89.6589.6589.6589.65
MCC0.77610.77610.77580.776
F1M(%)84.8784.8784.9884.91
Table 6

Performance of different methods (trained on DNASet and tested on DNAiSet).

Method ACC(%) MCC F1M(%) Se(%) Sp(%)
This work87.910.75686.0782.9392.00
DNAbinder(PSSM-21) 79.000.6170.3154.8798.08
DNAbinder(PSSM-400) 80.110.6272.7358.5397.97
DNA-Prot 84.610.6981.0873.1794.00
iDNA-Prot 77.470.5575.7378.0577.00
enDNA-Prot 84.620.7084.6273.1894.00
Table 7

Performance of different methods (trained on DNAeSet and tested on DNAiSet).

Method ACC(%) MCC F1M(%)
This work92.050.8491.24
DNAbinder(PSSM-21)72.930.5257.39
DNAbinder(PSSM-400)78.450.6168.80
DNA-Prot76.370.5864.46
iDNA-Prot76.920.5866.13
enDNA-Prot 89.560.7987.42
  38 in total

1.  A computational approach to simplifying the protein folding alphabet.

Authors:  J Wang; W Wang
Journal:  Nat Struct Biol       Date:  1999-11

2.  Modeling study on the validity of a possibly simplified representation of proteins.

Authors:  J Wang; W Wang
Journal:  Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics       Date:  2000-06

3.  Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve.

Authors:  C T Zhang; J Wang
Journal:  Nucleic Acids Res       Date:  2000-07-15       Impact factor: 16.971

4.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition.

Authors:  Hao Lin; En-Ze Deng; Hui Ding; Wei Chen; Kuo-Chen Chou
Journal:  Nucleic Acids Res       Date:  2014-10-31       Impact factor: 16.971

Review 5.  Graphical representation of proteins.

Authors:  Milan Randić; Jure Zupan; Alexandru T Balaban; Drazen Vikić-Topić; Dejan Plavsić
Journal:  Chem Rev       Date:  2010-10-12       Impact factor: 60.622

6.  Using deformation energy to analyze nucleosome positioning in genomes.

Authors:  Wei Chen; Pengmian Feng; Hui Ding; Hao Lin; Kuo-Chen Chou
Journal:  Genomics       Date:  2015-12-24       Impact factor: 5.736

7.  Light-directed synthesis of peptide nucleic acids (PNAs) chips.

Authors:  Zheng-Chun Liu; Dong-Sik Shin; Mohammadreza Shokouhimehr; Kook-Nyung Lee; Byung-Wook Yoo; Yong-Kweon Kim; Yoon-Sik Lee
Journal:  Biosens Bioelectron       Date:  2007-01-19       Impact factor: 10.618

8.  Representation of proteins as walks in 20-D space.

Authors:  M Novic; M Randic
Journal:  SAR QSAR Environ Res       Date:  2008 Apr-Jun       Impact factor: 3.000

9.  A Novel Protein Characterization Based on Pseudo Amino Acids Composition and Star-Like Graph Topological Indices.

Authors:  Ping-An He; Hong Tao; Tingting Ma; Qi Dai; Yuhua Yao
Journal:  Comb Chem High Throughput Screen       Date:  2017       Impact factor: 1.339

10.  iRNA-PseU: Identifying RNA pseudouridine sites.

Authors:  Wei Chen; Hua Tang; Jing Ye; Hao Lin; Kuo-Chen Chou
Journal:  Mol Ther Nucleic Acids       Date:  2016
View more
  1 in total

1.  FFP: joint Fast Fourier transform and fractal dimension in amino acid property-aware phylogenetic analysis.

Authors:  Wei Li; Lina Yang; Yu Qiu; Yujian Yuan; Xichun Li; Zuqiang Meng
Journal:  BMC Bioinformatics       Date:  2022-08-19       Impact factor: 3.307

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.