Literature DB >> 29380690

Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation.

Chun Li^1,2,3, Jialing Zhao², Changzhong Wang², Yuhua Yao¹.

Abstract

AIM AND
OBJECTIVE: The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information.
METHODS: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically.
RESULTS: By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82- 33.85% in terms of F1M.
CONCLUSION: These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

Entities: CellLine Chemical Disease Gene Species

Keywords: Adjacency matrix; Generalized PseAAC; graph; identification of DNA-binding proteins; phylogenetic analysis; protein sequences.

Mesh：

Substances：

Year: 2018 PMID： 29380690 PMCID： PMC5930480 DOI： 10.2174/1386207321666180130100838

Source DB: PubMed Journal: Comb Chem High Throughput Screen ISSN： 1386-2073 Impact factor: 1.339

INTRODUCTION

DNA-binding proteins (DNA-BPs) are very important functional proteins in a cell. These proteins play vital roles in various cellular processes, including DNA replication, transcription, regulation of gene expression, packaging, and other activities associated with DNA [1-5]. It is therefore substantially important to distinguish DNA-BPs from non-DNA-binding proteins (NBPs). In the past, many experimental and computational techniques have been developed for identifying DNA-BPs. Experimental techniques can provide a clear-cut answer to a query protein. However, the experimental methods are cost-intensive and time-consuming, and thus impractical for large datasets [3-7]. Computational methods can be broadly divided into two categories: structure-based method and sequence-based method. The former can discriminate DNA-binding and non-binding proteins with high accuracy, but these methods can't be employed in high throughput annotation, as they require the structure information of a query protein [1]. Though tremendous progress has been achieved in experimental determination of protein structures in the past five decades, it can't keep pace with the explosive growth of sequence information resulting from modern sequencing technology [8]. Yet as suggested by Anfinsen [9], proteins contain within their amino acid sequences enough information to determine their native conformation. Therefore, it is more promising to use sequence-based methods to identify DNA-BPs. One of the core issues to the sequence-based methods is how to characterize protein sequences and harvest the fruits hidden in them. The most typical approach is using the amino acid composition (AAC) to formulate a protein sequence. Owing to its simplicity, the AAC model was widely applied in a number of earlier statistic-based methods. However, as pointed out in Ref [6], if we denote by the counts of 20 standard amino acids in a protein sequence, then we can see that there are a total of different sequences/strings possessing the same AAC. The reason is that AAC model neglects the order relation among elements of a sequence. To overcome this drawback, the concept of pseudo amino acid composition (PseAAC, or Chou’s PseAAC) was proposed [10-18]. The essence of PseAAC is that it not only covers AAC, but also contains additional order-correlated factors along a protein sequence. Another popular way for sequence analysis is to convert the protein primary sequence over 20 amino acids into a reduced one. The earliest and simplest reduction was the well-known HP model, in which 20 standard amino acids are divided into two types, hydrophobic (H) (or non-polar) and polar (P) (or hydrophilic). On the basis of the classic model, a detailed HP model was introduced by dividing the polar class into three subclasses: positive polar, uncharged polar and negative polar [19]. In addition, a few five-group classifications of amino acids were presented for practical purposes [20-23]. By considering property-based triples, Li et al. [6] put forward a six-letter model of amino acids. Also based on three physical-chemical properties of amino acids, Yao et al. [24] mapped the 20 standard amino acids to eight vertices of a cube with the center of origin, and thus an eight-group model of amino acids is obtained. Motivated by the work mentioned above, we propose a generalized PseAAC which is grounded on a three-letter model and 2-D graphical representation of a protein sequence. We summarize the main work of this paper as follows: In section 2, we briefly introduce five datasets used in this study. In section 3, on the basis of two important physicochemical properties of amino acids, we cluster the 20 standard amino acids into three groups. By assigning to each group a representative symbol, we transform a protein sequence into a three-letter sequence. Then a 2-D graph without loops and multiple edges and its geometric line adjacency matrix are obtained. A sequence-derived feature vector of dimension (25+) is thus constructed to characterize a protein sequence. Our scheme is similar to, but obviously different from that of PseAAC. In section 4, we apply the presented feature vector to compare -globin proteins of 17 species and 72 spike proteins of coronaviruses respectively. Also, we develop a SVM (support vector machine) model using the generalized PseAAC to identify DNA-binding and non-binding proteins on three datasets. Experiment results show that the presented method outperforms the existing methods including DNAbinder [1], DNA-Prot [2], iDNA-Prot [3] and enDNA-Prot [4]. Finally, conclusions are given in section 5.

Datasets

In this study, the following five datasets are used. For convenience, they are denoted by BetaSet, CoVSet, DNASet, DNAeSet and DNAiSet, respectively.

BetaSet

The dataset called BetaSet is composed of -globin protein of 17 species: Human (ALU64020), Gorilla (P02024), Chimpanzee (P68873), Cattle (CAA25111), Banteng (BAJ05126), Goat (AAA30913), Sheep (ABC86525), European hare (CAA68429), Rabbit (CAA24251), House mouse (ADD52660), Western wild mouse (ACY03394), Spiny mouse (ACY03377), Norway rat (CAA29887), Opossum (AAA30976), Guttata (ACH46399), Gallus (CAA23700), Muscovy duck (CAA33756). This dataset is used to determine the adjustable parameters in a feature vector.

CoVSet

This dataset consists of 72 spike proteins of coronaviruses (CoVs), 23 of which are MERS-CoVs, and 30 are SARS-CoVs. CoVs can be divided into three groups according to serotypes. Group alpha (formerly known as CoV-1) and group beta (formerly CoV-2) contain mammalian viruses, while group gamma (formerly CoV-3) contains only avian viruses. The name, accession number, and abbreviation of the 72 sequences are listed in Table . According to the existing taxonomic groups, sequences 1-5 belong to the first group, sequences 6-8 belong to the third group, and the remainings belong to the second group.

DNASet

This is a benchmark dataset created in 2007 by Kumar et al. [1]. It contains 396 sequences, 146 of which are DNA-BPs (positive samples), and 250 NBPs (negative samples). In both the positive and the negative sets, the sequence similarity between any two proteins is not more than 25%.

DNAiSet

This dataset was also generated by Kumar et al. [1] which is based on the work of Wang and Brown [25]. It originally contains 92 DNA-BPs and 100 NBPs. In order to avoid overestimating a given method, those sequences having sequence similarity with DNASet were removed by Xu et al. [4], and the final dataset is composed of 82 DNA-BPs and 100 NBPs.

DNAeSet

As an expanded benchmark dataset, DNAeSet was constructed in 2014 by Xu et al. [4]. According to a sequence filter criteria which is identical to DNASet, they added a number of NBPs to DNASet, and the total number of NBPs is 2125. By removing the sequence which has sequence identity with DNAiSet, the current version of DNAeSet has 146 DNA-BPs and 1710 NBPs.

Methods

Three-letter Sequence of Protein Sequence and its 2-D Graphical Representation

Isoelectric point (pI) and relative distance (RD) are two important physicochemical properties of the 20 standard amino acids [26-28]. Their original numerical values are listed in Table . As can be seen from this table, the values of (isoelectric point) are in the range [2.97, 10.76], while (relative distance) varies between 1469 and 3355. Therefore, the normalization of these values is needed. Here, we scale them into the interval [0,1] by the formulary below: , , (1) The corresponding values are listed in Table . The last row in this table gives the average values. For the i-th amino acid , if , then we label it by “+”, otherwise we will label it by “-”. Similarly, if property is considered, the second label for amino acid can be obtained. In this way, each of the 20 standard amino acids has a label pair. In Table , the corresponding labels are also listed. Amino acids with a same label pair are viewed as members of a same group. Thus, the 20 standard amino acids are distributed to the following groups: GI={ A,Y,V,Q,M,L,I,E }, GII={ C,W,S,N,G,F,D }, GIII={ H,T,R,P,K }. For each group, the first amino acid is used to stand for the group. Thus the three groups have three representative letters, they are A, C and H, respectively. The value for the property of a group is defined as the average value for the property of all members in the group. In the left-hand side of Table , we list the corresponding values of the three groups. Obviously, each group can be viewed as a 2-D vector. In order to make the vectors of the three groups have unit length, we further normalize them to be unit vectors, and list the normalized values () in the right-hand side of Table . In Fig. (, we show the 2-D map of the 20 standard amino acids according to the classification above.

Fig. (1)

The 2-D map of the 20 standard amino acids.

By substituting each amino acid with its representative letter, a protein primary sequence is reduced into a three-letter sequence. For example, the three-letter sequence of the sequence segment EKAAVTGFWGKVKVDEVGAEA is AHAAAHCCCCHAHACAACAAA. To obtain the graphical representation of a reduced sequence, we start from the origin (0,0) and move in xoy-plane in the direction dictated by Fig. (. In mathematics, one can let be a given three-letter sequence. And then one has a map , which maps S into a plot set. Explicitly, , and is given by where, T represents the transpose of a matrix, (j=1,2) represents the j-th component of the unit vector corresponding to (cf. Fig. and Table ). Connecting all points of the plot set in turn, a 2-D curve is drawn. In Fig. (, we show the 2-D graphical representation of sequence AHAAAHCCCCHAHACAACAAA. It is not difficult to find that the 2-D graphical representation has no degeneracy, and thus is a simple graph, that is, a graph without loops and multiple edges.

Fig. (2)

The 2-D graphical representation.

(25 + ) Dimensional Feature Vector

In this section, we give a numerical characterization of a protein sequence that will facilitate quantitative comparisons of protein sequences. As is known, once a graphical representation is given, it can be transformed into some structural matrices, such as the matrices ED, GD, M/M, and L/L [6, 24, 29-37]. Here we employ the L/L matrix. L/L is a nonnegative symmetric matrix whose off-diagonal entries are defined as a quotient of the Euclidean distance between two vertices of the graph and the sum of geometrical lengths of edges between the two vertices. By definition all diagonal elements are zero. Obviously, the entries in a L/L matrix are less than or equal to one. The higher order kL/kL matrix is the matrix whose (i,j)-entry is . As the exponent k approaches positive infinity, kL/kL converges to a (0,1) matrix (denoted by bL/bL). With respect to the proposed 2-D graph, [bL/bL]ij=1 if and only if the two corresponding vertices lie on a straight line in the curve, including the cases of adjacency and non-adjacency. In this sense, we call such a matrix a geometric line adjacency matrix (GLAM), or simply a generalized adjacency matrix (GAM), generated by a graph, and denote it by . The first Zagreb index is a well-known vertex-degree-based molecular structure descriptor. This index was first time considered by Gutman and Trinajstic about 45 years ago, and since then discussed and used in numerous studies (see [38-40] and the references cited therein). The first Zagreb index is defined as (2) where du denotes the degree (=number of first neighbors) of the vertex u in graph G. If G is a simple graph (i.e. without loops and multiple edges), Zg1 can be also obtained directly from its adjacency matrix since the row-sums of this matrix are equal to degrees of the corresponding vertices. It should be mentioned that the Zagreb index gives greater weights to inner vertices and edges than to outer vertices and edges of a graph [38]. One way to amend it is to insert inverse values of the vertex-degree into Eq(2), and thus the modified Zagreb index has been proposed [38]: Clearly, mZg1 gives greater weights to outer vertices/edges than to inner ones in a graph. At the same time, on the basis of our geometric line adjacency matrix, we can count the vertex-pair with generalized adjacency relationship. It should be noted that, in our case, the 'neighbors' include not only the conventional neighbors, i.e. the first neighbors, but also the second neighbors, the third neighbors, and so on. We call the corresponding number of graph G a line-adjacency index, and denote it by La(G). Then we have a graph-based index: For a symmetric matrix, eigenvalue-based indices, such as the leading eigenvalue [29-33, 35] and the graph energy [17], are often used as the matrix invariants. Moreover, in our previous paper [41], an alternative invariant called ‘ALE-index’ was proposed. The ALE-index is defined by the following formula: (4) where L is the order of the matrix, and are the m1- and F-norms of a matrix respectively. In order to reduce variations caused by comparison of matrices with different sizes, we consider a normalized ALE-index instead of . For convenience, we denote this matrix-based index by . In addition, with respect to three-letter sequence , we define a coupling mode function by , (n=1, 2) (5) where P1 and P2 are values for properties of the corresponding representative letter (group), integer k represents the counted rank (or tier) of the coupling mode. Then, following the similar procedures in [10, 11], we can extract global sequence-order information of the three-letter sequence S by , , . (6)where is called the k-th tier correlation factor. Clearly, reflects the coupling mode between the most contiguous elements along three-letter sequence S, is the coupling mode between the second most contiguous, the third most contiguous, and so forth. Furthermore, if the respective counts of the three representative letters (A, C and H) in sequence S are , respectively, then we can obtain a so-called group composition (GC): where, denotes the size of a group (set). Consequently, elements are derived, which reflect the information about the reduced sequence and, particularly, the 2-D graphical representation. By combining these elements with the conventional amino acid composition (AAC), a dimensional feature vector can be constructed to numerically characterize a protein sequence: , (7) where (8) Here, are frequencies of occurrence of the 20 standard amino acids in a protein sequence, and are weight factors. As will be described later in detail, the four adjustable parameters in Eqs (7) and (8) can be determined by a set of known samples. Roughly speaking, the vector contains the feature of AAC, and the information beyond AAC as well, which is similar to Chou’s PseAAC in form. Therefore, we call such a vector formulated by Eqs (7) and (8) the generalized PseAAC of a protein sequence.

Results and Discussion

In this section, we will discuss the use of the generalized PseAAC. As can be seen from Eqs (7) and (8), the present mathematical descriptor contains four uncertain parameters: , w1, w2 and w3. Here represents the total number of correlation ranks counted (cf. Eq(6)), which is an integer. Generally speaking, the greater the value of , the more sequence-order effects will be incorporated. However, if the value is too large, it might cause the overfitting problem or ‘high dimension disaster’ [15], therefore, we endeavour to limit the value of to a small integer. In this study, the five datasets (BetaSet, CoVSet, DNASet, DNAeSet and DNAiSet) are arranged into two groups: one contains BetaSet, the other includes the rest. The first group is used for determining the four adjustable parameters, and the second group for testing purpose.

Parameter Determination

According to the method mentioned above, we first associate each of 17 protein sequences in BetaSet with a dimensional vector (cf. Eqs (7) and (8)), and then calculate the pair-wise Euclidean distance between any two of the 17 protein sequences via their m-D vectors. Thus a real symmetric matrix is obtained. On the basis of the achieved distance matrix , a UPGMA tree is constructed using MEGA4 package. The result will depend on values of the rank and the three weight factors. It is found that when , , and , the three non-mammals (Muscovy duck, Gallus and Guttata) form a separate branch and stay outside of the mammals. Moreover, in the subtree of mammals, primate species (Human, Chimpanzee, Gorilla) are grouped closely. Also, rodent species (Norway rat, Spiny mouse, House mouse, Western wild mouse) and lagomorph species (Rabbit, European hare) are situated at independent branches, respectively. While Goat, Sheep, Cattle and Banteng appear to cluster together (Fig. ). This result is analogous to that reported in the literature [6, 29, 30, 35, 36]. Accordingly, the four numerical values are respectively used for the four uncertain parameters, and a 31-D feature vector is thus obtained.

Test I: Phylogenetic Analysis of Coronavirus Spike Proteins

In order to evaluate the effectiveness of our method, we test it by phylogenetic analysis on the CoVSet dataset. Coronaviruses (CoVs) belong to the genus Coronavirus of family Coronaviridae [42]. The first coronavirus (HCoV-229E) was isolated from humans in 1965. Until 2003, coronaviruses attracted little interest beyond causing mild upper respiratory tract infections. However, this phenomenon changed dramatically with the emergence of SARS-CoV and MERS-CoV. As of July 2017, 2040 laboratory-confirmed cases of MERS-CoV infection were reported in over 27 countries, and at least 710 individuals have died (crude CFR 34.8%) [43]. Using the above-determined values for parameters , w1, w2, and w3, we calculate the 31-D feature vectors of 72 coronavirus spike proteins and their Euclidean distance matrix; then the corresponding phylogenetic tree (Fig. ) is constructed. Observing Fig. (, we find that the 72 coronavirus spike proteins are clustered into three groups: one contains the five alpha coronaviruses (PEDVC, PEDV, TGEVG, TGEV, and HCoV-229E), the second includes the three gamma coronaviruses (IBV, IBVBJ, IBVC), and the third corresponds to the group beta. A closer look at the subtree of beta coronaviruses shows that MERS-CoVs are clearly clustered together, so it is with SARS-CoVs, while MHV, MHVA, MHVM, MHVP, MHVJHM, BCoV, BCoVE, BCoVL, BCoVM, BCoVQ and HCoV-OC43 are situated at an independent branch. The resulting cluster agrees well with the established taxonomic groups.

Fig. (4)

The relationship tree of 72 coronavirus spike proteins.

Test II: Identification of DNA-binding Proteins

To further assess the effectiveness of the porposed method, we conduct a series of experiments of identification of DNA-binding proteins on three datasets: DNASet, DNAeSet and DNAiSet. Among them, DNASet and DNAeSet serve as training datasets, while DNAiSet serves as an independent testing dataset. Support vector machine (SVM) is employed as the classifier, and R package ‘e1071’ v1.6-8 [44] is used to implement SVM. For a given set of binary-labeled training examples, SVM maps the input space into a higher-dimensional space and seeks a hyperplane to separate the positive samples from the negative ones [25]. The optimal hyperplane maximizes the separation margin between the two classes of training data. The distance measurement between the data points in the high-dimensional space is defined by the kernel function. In this study, we use the radial basis function (RBF) kernel . This model involves two tunable parameters: the kernel width and the penalty parameter C. Prediction performance can be assessed using some quality indices including Accuracy (ACC), Sensitivity (Se), Specificity (Sp), F-measure (F1M) and Matthews correlation coefficient (MCC) [2, 4, 5, 25, 37, 45]: , , , (9) , , , . where TP, TN, FP, and FN are defined as the numbers of true positive, true negative, false positive, and false negative samples obtained from the prediction respectively, while P and R denote Precision value and Recall value, respectively. One can also use the alternative definition by a series of studies published recently [15, 46-48]. The higher the values of these measurements, the better the quality of prediction.

Predictive Performance on Benchmark Dataset

This experiment is made on DNASet itself. To obtain a reliable result with few error, the SVM model on DNASet is established by 5-fold cross-validation (5CV) with 3 runs. Here the 31-D feature vector of a protein sequence serves as the input for SVM. In a 5CV, the positive and negative samples are randomly distributed into five subsets or the so-called folds, and the test is repeated five times. In each of the five iterations, one subset is used as the testing set, while the remaining four subsets are combined together and used to build a classifier (training). The predictions made for the test data instances in all the five iterations yield the final result. The sensitivity, specificity, ACC, MCC and F1M are calculated for each run, and the corresponding results and their average values are listed in Table . As can be seen from this table, we achieve the accuracy (ACC) of 89.65%, with MCC of 0.776 and F1M of 84.91%. This result shows that our SVM model performs well on the benchmark dataset DNASet.

Predictive Performance on Blind Dataset

It is important to examine the performance of the newly developed method on an independent dataset. In this experiment, we establish the classifier with the benchmark dataset DNASet and then test it on the independent dataset DNAiSet. To decide the parameter pair (γ, C), we utilize a systematic grid search for and , where integers i and j are in ranges [-3, 3] and [0, 3], respectively. It is find that and are the optimal values for DNASet. With the best pair (γ, C), DNAiSet is fed to the SVM. As a result, our model correctly predicts 68 out of 82 DNA-BPs and 92 out of 100 NBPs. The ACC arrives at 87.91%, with the MCC, sensitivity, specificity, and F1M of 0.756, 82.93%, 92.00% and 86.07%, respectively (see Table ). This demonstrates that our SVM model performs equally well on independent dataset. For convenience of comparison, results of some existing methods including DNAbinder [1], DNA-Prot [2], iDNA-Prot [3] and enDNA-Prot [4] are also listed in Table . DNAbinder developed by Kumar et al. [1] can extract evolutionary information in form of position specific scoring matrix (PSSM) from the corresponding protein sequence. PSSM-21 and PSSM-400 are two feature vectors generated by means of PSSM, whose dimensions are 21 and 400, respectively. In [1], PSSM-400 based SVM model was mainly used for predicting DNA-BPs. DNA-Prot [2] is a Random Forest based method, in which the feature vector includes sequence information and structure information, such as the composition of 20 standard amino acids, composition of 10 amino acid groups, and secondary structure information predicted from a protein sequence. iDNA-Prot [3] constructs the feature vector via the grey model, and Random Forest is also used as the operation engine. EnDNA-Prot [4] is a predictor which encodes a protein sequence into a feature vector with dimension of 188 and adopts an ensemble classifier constructed with four types of machine learning classifiers. All these methods are tested on the same datasets to make an unbiased comparison with our method. Observing Table , we can see that the current approach outperforms other methods by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. This result indicates that our method achieves highly comparable performance.

Impact of the Number of Negative Samples

When the size of positive samples is comparable to that of negative samples, many machine learning algorithms should have better performance. However, in real life, the number of non-binding proteins is much greater than that of DNA-BPs, i.e., . (10) In this case, the frequency of NBPs is generally much greater than that of the binding ones in the predictions, that is, . (11) Eqs (10) and (11) lead to that the value of ACC defined by Eq (9) tends towards 1. To solve this problem, instead of using the definition of ACC in Eq (9), here we use the alternative definition [49, 50]: . (12) In order to analyze the influence of the number of negative samples in a benchmark dataset on the predictive performance of the current method, we construct a series of subsets of DNAeSet and use them as training set in turn, while DNAiSet is always used as the testing set. Each subset contains all the 146 DNA-BPs and a part of NBPs in DNAeSet. In detail, if the set of NBPs in is denoted by , k=1, 2, ..., then consists of 250 NBPs randomly selected from DNAeSet. And is obtained by adding 50 NBPs to , until 1700 NBPs are contained in it. For each subset , k=1, 2, ..., 30, we develop the SVM model by 5CV with 3 runs. The results averaging over the three runs are given in Fig. (. From Fig. ( we can see that the curves of ACC and acc visibly split with each other when n, the size of , is larger. With increasing of n, ACC increases rapidly, while acc tends to be steady. The value of ACC seems higher and higher on the surface, but it cannot correctly reflect the performance because it is nothing but a false appearance.

Fig. (5)

The influence of the number of negative samples.

In order to show the advantage of their method, Xu et al. [4] created a dataset called expanded benchmark dataset1100 with all the 146 positive samples and 1100 negative samples in DNAeSet, which is employed as another training dataset to evaluate the predictive performance on the independent dataset DNAiSet. For convenience of comparison, we also select the expanded benchmark dataset to establish the classifier and test it on DNAiSet. Repeating this procedure five times, the average results are given in Table (the first row). Results obtained by the other four methods (DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot) trained on the expanded benchmark dataset with n=1100 are also listed in Table . From this table we see that the overall accuracy of our method is about 92%, with MCC of 0.84 and F1M of 91.24%, which outperforms other methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82-33.85% in terms of F1M. This suggests that our method performs well on unbalanced datasets. Based on two important physicochemical properties, 20 standard amino acids were distributed into three groups, and to each of which a representative symbol was assigned. By replacing each amino acid with its representative letter, a protein primary sequence was converted into a three-letter sequence, which can be viewed as a coarse-grained description of the protein primary sequence. On the basis of the three-letter sequence, a graph without loops and multiple edges was obtained. By taking the advantage of the 2-D graph, we constructed a geometric line adjacency matrix (GLAM) and then the corresponding ALE-index, the line-adjacency index, the first Zagreb index and its modification were calculated. In addition, order-correlated factors were extracted via the reduced sequence. By combining these elements with the frequencies of occurrence of 20 standard amino acids and their three representative letters, a generalized PseAAC model of a protein sequence was constructed. On five popular datasets, the proposed method was tested by phylogenetic analysis and identification of DNA-binding proteins. The results illustrated the better performance of our method.

Consent for Publication

Not applicable.

Table 1

The accession number, name and abbreviation for 72 coronavirus spike proteins.

No.	Accession number	Virus name/strain	Abbreviation
1.	CAB91145	Transmissible gastroenteritis virus, genomic RNA	TGEVG
2.	NP_058424	Transmissible gastroenteritis virus	TGEV
3.	AAK38656	Porcine epidemic diarrhea virus strain CV777	PEDVC
4.	NP_598310	Porcine epidemic diarrhea virus	PEDV
5.	BAL45637	Human coronavirus 229E	HCoV-229E
6.	AAP92675	Avain infectious bronchitis virus isolate BJ	IBVBJ
7.	AAS00080	Avain infectious bronchitis virus strain Ca199	IBVC
8.	NP_040831	Avain infectious bronchitis virus	IBV
9.	NP_937950	Human coronavirus OC43	HCoV-OC43
10.	AAK83356	Bovine coronavirus isolate BCoV-ENT	BCoVE
11.	AAL57308	Bovine coronavirus isolate BCoV-LUN	BCoVL
12.	AAA66399	Bovine coronavirus strain Mebus	BCoVM
13.	AAL40400	Bovine coronavirus strain Quebec	BCoVQ
14.	NP_150077	Bovine coronavirus	BCoV
15.	AAB86819	Mouse hepatitis virus strain MHV-A59C12 mutant	MHVA
16.	YP_209233	Murine hepatitis virus strain JHM	MHVJHM
17.	AAF69334	Mouse hepatitis virus strain Penn 97-1	MHVP
18.	AAF69344	Mouse hepatitis virus strain ML-10	MHVM
19.	NP_045300	Mouse hepatitis virus	MHV
20.	AAU04646	SARS coronavirus civet007	civet007
21.	AAU04649	SARS coronavirus civet010	civet010
22.	AAU04664	SARS coronavirus civet020	civet020
23.	AAV91631	SARS coronavirus A022	A022
24.	AAV49730	SARS coronavirus B039	B039
25.	AAP51227	SARS coronavirus GD01	GD01
26.	AAS00003	SARS coronavirus GZ02	GZ02
27.	AAP30030	SARS coronavirus BJ01	BJ01
28.	AAP13567	SARS coronavirus CUHK-W1	CUHK-W1
29.	AAP37017	SARS coronavirus TW1	TW1
30.	AAR87523	SARS coronavirus TW2	TW2
31.	BAC81348	SARS coronavirus TWH genomic RNA	TWH
32.	BAC81362	SARS coronavirus TWJ genomic RNA	TWJ
33.	AAQ01597	SARS coronavirus Taiwan TC1	TaiwanTC1
34.	AAQ01609	SARS coronavirus Taiwan TC2	TaiwanTC2
35.	AAP97882	SARS coronavirus Taiwan TC3	TaiwanTC3
36.	AAP13441	SARS coronavirus Urbani	Urbani
37.	AAP72986	SARS coronavirus HSR 1	HSR1
38.	AAQ94060	SARS coronavirus AS	AS
39.	AAP94737	SARS coronavirus CUHK-AG01	CUHK-AG01
40.	AAP94748	SARS coronavirus CUHK-AG02	CUHK-AG02
41.	AAP94759	SARS coronavirus CUHK-AG03	CUHK-AG03
42.	AAP30713	SARS coronavirus CUHK-Su10	CUHK-Su10
No.	Accession number	Virus name/strain	Abbreviation
43.	AAP33697	SARS coronavirus Frankfurt 1	Frankfurt1
44.	AAR14803	SARS coronavirus PUMC01	PUMC01
45.	AAR14807	SARS coronavirus PUMC02	PUMC02
46.	AAR14811	SARS coronavirus PUMC03	PUMC03
47.	AAP41037	SARS coronavirus TOR2	TOR2
48.	AAP50485	SARS coronavirus FRA	FRA
49.	AAR23250	SARS coronavirus Sin01-11	Sino1-11
50.	AHX00731	MERS coronavirus	KFU-HKU1
51.	AHX00711	MERS coronavirus	KFU-HKU13
52.	AHX00721	MERS coronavirus	KFU-HKU19Dam
53.	AIY60578	MERS coronavirus	Abu-Dhabi_UAE_9
54.	AIY60568	MERS coronavirus	Abu-Dhabi_UAE_33
55.	AIZ74417	MERS coronavirus	Hu-France(UAE)-FRA1
56.	AIZ74433	MERS coronavirus	Hu-France-FRA2
57.	ALJ54502	MERS coronavirus	Hu/Qunfidhah-KSA-Rs1338
58.	AKN24821	MERS coronavirus	KFMC-1
59.	AKN24830	MERS coronavirus	KFMC-7
60.	ALJ76282	MERS coronavirus	Hu/Taif,KSA-2083
61.	ALJ76281	MERS coronavirus	Hu/Taif,KSA-5920
62.	ALJ54493	MERS coronavirus	Hu/Makkah-KSA-728
63.	ALB08267	MERS coronavirus	KOREA/Seoul/014-1
64.	ALB08278	MERS coronavirus	KOREA/Seoul/014-2
65.	ALR69641	MERS coronavirus	D2731.3
66.	AKQ21055	MERS coronavirus	ADFCA-HKU1
67.	AKQ21064	MERS coronavirus	ADFCA-HKU2
68.	AKQ21073	MERS coronavirus	ADFCA-HKU3
69.	ALA50001	MERS coronavirus	camel/Taif/T68
70.	ALA50012	MERS coronavirus	camel/Taif/T89
71.	ALT66813	MERS coronavirus	Jordan_1
72.	ALT66802	MERS coronavirus	Jordan_10

Table 2

The original numerical values for properties of the 20 standard amino acids.

Amino acid (AA)	pI ^a ( P10 )	RD ^a ( P20 )
ACDEFGHIKLMNPQRSTVWY	6.025.022.973.225.485.977.596.029.745.985.755.426.305.6510.765.686.535.975.895.66	18893355220918121916207815071765179718221689194317201538169720001469168023171787

a: taken from [26-28]

Table 3

The scaled values for properties of the 20 standard amino acids.

AA	P1*	lable1	P2*	Lable2
ACDEFGHIKLMNPQRSTVWY	0.39150.263200.03210.32220.38510.59310.39150.86910.38640.35690.31450.42750.34401.00000.34790.45700.38510.37480.3453	------+-+---+-+-+---	0.22271.00000.39240.18190.23700.32290.02010.15690.17390.18720.11660.25130.13310.03660.12090.281500.11190.44960.1686	-++-++-----+---+--+-
Pn¯	0.3994		0.2283

Table 4

The values for properties of the three groups.

Group	Representative	P1'	P2'	P1	P2
G_IG_IIG_III	ACH	0.3291 0.2868 0.6693	0.14780.41930.0896	0.91220.56460.9912	0.40970.82530.1327

Table 5

The results of 5CV for 3 runs.

Test	1	2	3	Average
Se(%)	78.77	78.77	79.45	79.00
S_p(%)	96.00	96.00	95.60	95.87
Acc(%)	89.65	89.65	89.65	89.65
MCC	0.7761	0.7761	0.7758	0.776
F1M(%)	84.87	84.87	84.98	84.91

Table 6

Performance of different methods (trained on DNASet and tested on DNAiSet).

Method	ACC(%)	MCC	F1M(%)	Se(%)	Sp(%)
This work	87.91	0.756	86.07	82.93	92.00
DNAbinder(PSSM-21)	79.00	0.61	70.31	54.87	98.08
DNAbinder(PSSM-400)	80.11	0.62	72.73	58.53	97.97
DNA-Prot	84.61	0.69	81.08	73.17	94.00
iDNA-Prot	77.47	0.55	75.73	78.05	77.00
enDNA-Prot	84.62	0.70	84.62	73.18	94.00

Table 7

Performance of different methods (trained on DNAeSet and tested on DNAiSet).

Method	ACC(%)	MCC	F1M(%)
This work	92.05	0.84	91.24
DNAbinder(PSSM-21)	72.93	0.52	57.39
DNAbinder(PSSM-400)	78.45	0.61	68.80
DNA-Prot	76.37	0.58	64.46
iDNA-Prot	76.92	0.58	66.13
enDNA-Prot	89.56	0.79	87.42

38 in total

1. A computational approach to simplifying the protein folding alphabet.

Authors: J Wang; W Wang
Journal: Nat Struct Biol Date: 1999-11

2. Modeling study on the validity of a possibly simplified representation of proteins.

Authors: J Wang; W Wang
Journal: Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics Date: 2000-06

3. Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve.

Authors: C T Zhang; J Wang
Journal: Nucleic Acids Res Date: 2000-07-15 Impact factor: 16.971

4. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition.

Authors: Hao Lin; En-Ze Deng; Hui Ding; Wei Chen; Kuo-Chen Chou
Journal: Nucleic Acids Res Date: 2014-10-31 Impact factor: 16.971