Literature DB >> 29524440

A novel model for protein sequence similarity analysis based on spectral radius.

Chuanyan Wu¹, Rui Gao², Yang De Marinis³, Yusen Zhang⁴.

Abstract

Advances in sequencing technologies led to rapid increase in the number and diversity of biological sequences, which facilitated development in the sequence research. In this paper, we present a new method for analyzing protein sequence similarity. We calculated the spectral radii of 20 amino acids (AAs) and put forward a novel 2-D graphical representation of protein sequences. To characterize protein sequences numerically, three groups of features were extracted and related to statistical, dynamics measurements and fluctuation complexity of the sequences. With the obtained feature vector, two models utilizing Gaussian Kernel similarity and Cosine similarity were built to measure the similarity between sequences. We applied our method to analyze the similarities/dissimilarities of four data sets. Both proposed models received consistent results with improvements when compared to that obtained by the ClustalW analysis. The novel approach we present in this study may therefore benefit protein research in medical and scientific fields.

Entities: CellLine Chemical Disease Gene Species

Keywords: Fluctuation complexity; Functional group; Protein sequence similarity analysis; Protein vector

Mesh：

Substances：
Proteins

Year: 2018 PMID： 29524440 PMCID： PMC7094169 DOI： 10.1016/j.jtbi.2018.03.001

Source DB: PubMed Journal: J Theor Biol ISSN： 0022-5193 Impact factor: 2.691

Introduction

With the development of sequencing techniques, the discovery of biological sequences increases fast. Effective extraction and analysis of biological information from large data base has drawn much attention in the field of bioinformatics. Sequence similarity and evolution relationship analysis in order to get the function of unknown sequences (Louie et al., 2009) may shed light on identification of potential drug targets and to gain insights on underlying molecular mechanisms of diseases (Jiang and Zhou, 2005). For protein sequence similarity analysis, there are several commonly applied methods, which can be divided into two groups: alignment-based methods (Chakraborty, Bandyopadhyay, 2012, Gotoh, 1982, Liu, Yang, Wang, Yao, Dai, 2015) and alignment-free methods. In the alignment-based methods, a sequence alignment scoring matrix and gap penalty parameters are used to represent insertion, deletion or substitution of AAs in the compared sequences. Nevertheless, due to the fact that alignment-based approaches are generally memory demanding and time consuming, a lot of alignment-free methods (Yu et al., 2017) are applied alternatively, which use numerical characterization of protein sequences by extracting invariants from sequences indirectly. With the representation of a protein sequence, there are mainly two types of alignment-free methods: (1) digital-signal-based representation and (2) graphical representation. The digital-signal-based representation encodes a single amino acid (AA) into a number so that a protein sequence is converted into a digital signal sequence, which is processed by digital signal analysis tools to extract the features of the protein sequence. For example, in a study performed by Hou et al. (2017), protein sequence was converted into numerical sequences with their physicochemical properties to achieve the power spectra by Discrete Fourier Transform (DFT). Furthermore, Dynamic Time Warping (DTW) was used to extend the spectra to the same length in order to calculate the distance between different sequences. Su and Bao (2013) proposed a method based on Discrete Wavelet Transform (DWT) to measure protein sequence similarity. The model employed only the approximation coefficients of DTW so that the feature vector was short enough to bring a great running time promotion. Graphical representation has been widely explored in bioinformatics research (Czerniecka, Bieli Ska-W, W, Clark, 2016, Yao, Yan, Han, Dai, He, 2014). It represents a protein sequence graphically and then extracts feature vector of the graph. Various approaches on graphical representation were proposed according to the physicochemical properties of the AAs. Some methods converted an AA to a discrete point according to its physicochemical properties (Xu et al., 2014), and some methods mapped an AA to a unique value by principle components of physicochemical properties (Wang et al., 2014). Zhang et al. (2015) put forward a 2-D graphical representation by converting each AA to a point according to the hydrophobicity and hydropathy indexes. Then the cumulative distance of every point was utilized to present the distance of the sequences. In addition, similar approaches have been proposed by several studies (Li, Geng, He, Yao, 2014, Qi, Jin, 2016). Furthermore, several graphical representations of protein sequences have applied reduced protein models (Li, Yu, Yang, Zheng, Wang, 2009, Randić, Vračko, Novič, Plavšić, 2009). Yao et al. (2014c) simplified twenty AAs into four types with preset values according to hydrophobicity. Four consecutive numbers were summed as the amplitudes of vertical axis. Thus, a protein sequence can be characterized by a 17-D vector containing the frequencies of the amplitudes. Based on the idea of cyclic order of 20 AAs, Ellakkani and Mahran (2015) selected twenty concentric evenly spaced circles divided by n radial lines into equal divisions to represent any protein sequence of length n. The mean of each two successive distances between each two successive AAs was calculated. The set of the different mean distances with its frequencies was taken as a mathematical descriptor. The graph-energy-based methods are also effective graphical representation (Sun et al., 2016). Wu et al. (2015) calculated the graph energy and Laplacian energy of 20 AAs by the codons of the AAs, and applied them to a novel 2-D graphical representation of proteins to analyze the similarity/dissimilarity of protein sequences. Albeit previous achievements, the research on dynamic feature and non-linear feature of protein sequence is relatively few. In this study, we calculated the spectral radii of 20 AAs and applied the obtained spectral radii to a novel 2-D graphical representation. The 2-D graph was characterized mathematically by extracting three groups of features. The static features of the protein sequence included the mean of spectral radii (MSR), distribution of spectral radii (DSR) and distribution of functional groups (DFG). The dynamics features of the protein sequence included distribution of spectral radii transitions (DSRT), distribution of functional groups transitions (DFGT). The non-linear feature was fluctuation complexity (C). With the mathematical characterization, two models adopting Gaussian Kernel similarity and Cosine similarity were built to analyze the similarities among nine NADH5 (ND5), thirty-five Coronavirus Spike Proteins (CoVPs), twenty-four transferrin proteins (TFs) and twenty-seven antifreeze protein sequences (AFPs). Our results were consistent with improvements when compared to that achieved by the ClustalW. The simulations showed that the graphical representation represented the sequence visually and comprehensively, and the proposed method was effective for protein sequence similarity analysis.

Materials and methods

Spectral radius of graph

For a graph, there are many measurements, such as graph energy (Qi, Wu, Zhang, Fuller, Zhang, 2011, Wu, Zhang, Chen, Mu, 2015, Yu, Zhang, Gutman, Shi, Dehmer, 2017), Laplacian graph ene.g. (Wu et al., 2015), spectral radius, point centrality (Zhou et al., 2016), average degree of nodes (Zhou et al., 2016) etc. They have been applied in sequence analysis successfully. For example, a weighted directed graph was set up for each DNA sequence. The adjacency matrix of the directed graph was used to induce a representative vector for a DNA sequence (Qi et al., 2011). The spectral radius of a graph is the largest eigenvalue of the adjacency matrix of the graph. It has been widely used as a metric in classification. In the model called PROTNN, a rich set of structural and topological attributes including spectral radius were extracted to classify protein structures (Dhifli and Diallo, 2016). A metric called the spectral radius ratio was defined as the ratio of the spectral radius to the average node degree in order to measure the variation in node degree for complex network graphs (Meghanathan, 2014). All the previous researches indicated that spectral radius was an effective metric. Thus, the spectral radius was adopted to model the protein sequence. was set to be a graph possessing n vertices and m edges, with the set of vertices and the edges set . An adjacency matrix of G was defined, where was set to be the eigenvalues of adjacency matrix A(G). Spectral radius ρ(A) (Yu et al., 2004) was defined as

The spectral radii of 20 AAs

The method to get the spectral radii of 20 AAs is briefly described as the following. Our method is based on the graphs of 20 AAs introduced by Wu et al. (2015). Four nucleotides i.e., A,G,C,T were mapped to four unit vectors with different directions, respectively. An AA was mapped to a graph by a walking method to connect the codons of the specified AA. The walking method is described briefly as follows. The walker began to walk from (0, 0). If the following nucleotide in the codons was the same with the current one, the walker would not change the walking direction and only add the value of edges by one. Otherwise, the walker would change the walking direction according to the direction of the nucleotide. Thus, the nucleotides were connected together to form a graph. Then the graph was transformed to an undirected graph. In the graph, the value of the edge denoted the walking times. The graphs of 20 AAs are shown in Figure S1 in supplementary materials. After getting the graphs of 20 AAs, the adjacency matrices of the graphs were established to calculate the eigenvalues. Thus, the spectral radii of 20 AAs were calculated according to (2) and the result is shown in Table 1 .

Table 1

The spectral radii and type numbers of 20 AAs.

AA	SR	Type	AA	SR	Type	AA	SR	Type	AA	SR	Type
A	6.04	1	G	8.13	6	M	2.24	11	S	5.19	16
C	2.5	2	H	2.93	7	N	3.26	12	T	3.15	17
D	3.02	3	I	5.15	8	P	8.13	13	V	3.61	18
E	2.46	4	K	5.1	9	Q	2.46	14	W	2.24	19
F	3.26	5	L	7.42	10	R	8.32	15	Y	3.26	20

The spectral radii and type numbers of 20 AAs.

The 2-D graphical representation of protein sequence

Suppose a protein sequence is denoted by where S denotes the ith AA along the protein sequence. In order to represent the order and types of AAs in the sequence, the x-coordinate value includes two parts. One part is the ordinal number of the ith AA appearing in the sequence. The other part describes the types of the 20 AAs. Type numbers of the AAs were defined according to the alphabetic order to distinguish the 20 AAs with definition shown in Table 1. For graphical representation of the ith AA, we defined where T denoted the type of the ith AA defined in Table 1, sr denoted the spectral radius of the ith AA along the sequence. Our method was performed on two short fragments of Saccharomyces cerevisiae to demonstrate the proposed graphical representation. Protein I (PI) and protein II (PII) sequences are PI:WTFESRNDPAKDPVILWLNGGPGCSSLTGL, PII:WFFESRNDPANDPI ILWLNGGPGCSSFTGL. The 2-D graphical representation of PI and PII was shown in Fig. 1 . Fig. 1 showed that there were four different points in the two sequences intuitively, which was consistent with the result of manual alignment.

Fig. 1

The 2-D graphical representation of protein I (PI) and protein II (PII) sequences. PI and PII sequences are PI:WTFESRNDPAKDPVILWLNGGPGCSSLTGL, PII:WFFESRNDPANDPIILWLNGGPGCSSFTGL.

Numerical characterization of protein sequence

Mean spectral radius (MSR)

To extract the mean value of spectral radii of the protein sequence, we defined where sr denoted the spectral radius of the ith AA in the protein sequence.

Distribution of spectral radii (DSR)

According to the size and clustering of the spectral radii of 20 AAs, the values of spectral radii were classified into eight intervals as Interval 1 = {2  ≤  sr  ≤  2.5}, Interval 2 = {2.5  <  sr  ≤  3.15}, Interval 3 = {3.15  <  sr  ≤  5}, Interval 4 = {5  <  sr  ≤  6}, Interval 5 = {6  <  sr  ≤  7}, Interval 6 = {7  <  sr  ≤  8}, Interval 7 = {8  <  sr  ≤  8.3} and Interval 8 = {8.3  <  sr  ≤  9}. Let which represented the ith interval. To calculate distribution F of spectral radius intervals, we defined where

Distribution of spectral radii transitions (DSRT)

To obtain distribution of spectral radius intervals transitions in a sequence, we defined where A was rebuild to a row vector by

Distribution of functional groups (DFG)

Physicochemical properties of AAs are largely related to the side chain of AAs. Each property of AAs has its particularity, which depends on the type of the side chain the AAs possess (Hayat and Khan, 2013). By the presence of side chain chemical group, 20 AAs were classified into 10 functional groups: phenyl (F/W/Y), carboxyl (D/E), imidazole (H), primary amine (K), guanidino (R), thiol (C), sulfur (M), amido (Q/N), hydroxyl (S/T) and nonpolar (A/G/I/L/V/P) (Pugalenthi et al., 2008). Let F, D, H, K, R, C, M, Q, S and A represent each group, respectively. Thus, the sequence can be represented by In order to calculate distributions G of the ten functional groups, we defined where

Distribution of functional groups transitions (DFGT)

To get distributions of functional groups transitions, we defined where B was rebuild to a row vector as

Fluctuation complexity (C)

Fluctuation complexity (Grassberger, 1986) can be applied for classification and has been widely used to describe symbol sequences in information sciences (Parrott, 2010), which was defined by Bates and Shepard (Grassberger, 1986). It is well known that the function of proteins varies according to the type and order of AA residues. Fluctuation complexity considers both the probability of single AA and the transition probability. It reflects the fluctuation in net information gain of the sequence. Thus, fluctuation complexity was adopted to characterize the protein sequence. Fluctuation complexity was defined as where L denoted the number of states existing in the sequence which was equal to the type number of AAs as 20 in this paper, P calculated by (10) denoted the probability of the ith state in a sequence and P calculated by (11) denoted the transition probability of state i followed by the state j in a sequence. P and P were defined as where C denoted the number of the ith AA in the sequence, C denoted the number of the jth AA following the ith AA in the sequence which was calculated by

The numerical feature vector

The whole numerical feature vector (fv) of the protein sequence was constructed as where and C were calculated by 4–(9), respectively.

The models of similarity/dissimilarity analysis

Gaussian Kernel similarity can reflect the degree of the tested point belonged to the cluster with the given centroid and adjustable bandwidth. Cosine similarity measures the direction similarity of two vectors. Thus, we adopted the two popular measurements of distance between two vectors to reflect the similarity/dissimilarity of two protein sequences. For two protein sequences P1 and P2, the corresponding feature vectors were and where denoted the ith feature in the jth protein, n denoted the number of features calculated by (12). The distance based on Gaussian Kernel similarity between and was defined as where the parameter σ controlled the bandwidth which was equal to 4 in this paper. The second distance measurement d(s, t) between two vectors and was defined to be one minus the Cosine of the included angle between and , which was based on the assumption that two protein sequences were similar if the corresponding feature vectors had similar direction, i.e.,

Materials

Four data sets were curated to evaluate the proposed method. The first data set is nine ND5. Nine ND5 contains the NADH5 of nine species including Human, Gorilla, Pigmy Chimpanzee, Common Chimpanzee, Fin Whale, Blue Whale, Rat, Mouse and Opossum from NCBI (Xu et al., 2016). The accession numbers are listed in Table S1 of supplementary materials. The second data set is thirty-five Coronavirus Spike Proteins. The proteins were derived from the NCBI. Thirty-five Coronavirus Spike Proteins are from species of order Nidovirales, family Coronaviridae and subfamily Coronavirinae. The information and accession numbers (Mu et al., 2016) are listed in Table S2 in supplementary materials. The third data set is twenty-four previously published transferrin proteins from fish, amphibians and mammals of twenty-four vertebrates, whose taxonomic information and accession numbers (Xu et al., 2016) are provided in Table S3 in supplementary materials. Furthermore, twenty-seven antifreeze protein sequences (AFPs) from spruce budworm (Choristoneura fumiferana, CF), yellow mealworm (Tenebrio molitor, TM), Hypogastrura harveyi (HH), Dorcus curvidens binodulosus (DCB), Microdera dzhungarica punctipennis (MDP) and Dendroides canadensis (DC) (Zhang, 2010) were analyzed.

Results and discussion

Similarity analysis of nine ND5

Mitochondrial NADH deaminase Subunit 5 (ND5) is widely used in the analysis of phylogeny and population genetic diversity because of its high mutation rate. To illustrate the proposed similarities/dissimilarities models, the similarities of nine ND5 protein sequences across nine species were analyzed. We calculated the distances by Gaussian Kernel similarity and Cosine similarity, (Tables 2 and 3 , respectively), which were then compared to the analysis achieved by the ClustalW (Table 4 ) in order to validate its effectiveness. The corresponding phylogenetic trees (Fig. 2 ) showed that Gaussian Kernel similarity analysis is consistent with that of the ClustalW, while Cosine similarity analysis is closely consistent with that of the ClustalW.

Table 2

The distance matrix of the nine ND5 protein sequences calculated by Gaussian Kernel similarity analysis.

	Human	Gorilla	C.Chim.	P.Chim.	Rat	Mouse	Opossum	F.Whale	B.Whale
Human	0	0.48	0.35	0.36	0.87	0.84	0.89	0.75	0.77
Gorilla		0	0.48	0.46	0.86	0.81	0.89	0.77	0.81
C.Chim.			0	0.21	0.85	0.79	0.88	0.73	0.73
P.Chim.				0	0.83	0.76	0.85	0.72	0.72
Rat					0	0.64	0.82	0.85	0.85
Mouse						0	0.78	0.77	0.77
Oppossum							0	0.84	0.83
F.Whale								0	0.20
B.Whale									0

Table 3

The distance matrix of the nine ND5 protein sequences calculated by Cosine similarity analysis.

	Human	Gorilla	C.Chim.	P.Chim.	Rat	Mouse	Opossum	F.Whale	B.Whale
Human	0	0.18	0.13	0.13	0.57	0.54	0.53	0.38	0.42
Gorilla		0	0.18	0.17	0.53	0.48	0.52	0.39	0.43
C.Chim.			0	0.72	0.56	0.47	0.54	0.37	0.37
P.Chim.				0	0.53	0.46	0.49	0.38	0.37
Rat					0	0.30	0.41	0.53	0.53
Mouse						0	0.38	0.43	0.42
Oppossum							0	0.44	0.43
F.Whale								0	0.59
B.Whale									0

Table 4

The distance matrix of the nine ND5 protein sequences calculated by the ClustalW.

	Human	Gorilla	C.Chim.	P.Chim.	Rat	Mouse	Opossum	F.Whale	B.Whale
Human	0	0.104	0.067	0.069	0.456	0.443	0.464	0.375	0.377
Gorilla		0	0.096	0.093	0.469	0.453	0.494	0.39	0.387
C.Chim.			0	0.048	0.461	0.448	0.472	0.37	0.37
P.Chim.				0	0.453	0.443	0.459	0.368	0.368
Rat					0	0.241	0.494	0.41	0.407
Mouse						0	0.469	0.422	0.415
Opossum							0	0.486	0.486
F.Whale								0	0.034
B.Whale									0

Fig. 2

Phylogenetic trees of the nine ND5 constructed by Gaussian Kernel similarity (a), Cosine similarity (b) and the ClustalW (c).

The distance matrix of the nine ND5 protein sequences calculated by Gaussian Kernel similarity analysis. The distance matrix of the nine ND5 protein sequences calculated by Cosine similarity analysis. The distance matrix of the nine ND5 protein sequences calculated by the ClustalW. Phylogenetic trees of the nine ND5 constructed by Gaussian Kernel similarity (a), Cosine similarity (b) and the ClustalW (c). In addition, the correlation coefficients between the distance matrices calculated by Gaussian Kernel similarity and by the ClustalW were calculated. Pearson’s correlation coefficient of X and Y is where Cov denotes the covariance, σ denotes the standard deviation of X, σ denotes the standard deviation of Y. For example, X represents the distances between Human and nine species (listed in the first row in Table 2), which were calculated by Gaussian Kernel similarity. Y represents the distances between Human and nine species (listed in the first row in Table 4), which were calculated by the ClustalW. The correlation coefficient of X and Y were calculated by (13), which is 0.96 as shown in the first column and first row of Table 5 . With the same method, the correlation coefficient for other species were calculated, respectively (fist column in Table 5). Similarly, the correlation coefficients between the result by the ClustalW and the results by some state-of-the-art methods were calculated (Table 5). The results were also presented in graphical format (Fig. 3 ), which showed that the result by the proposed method has relatively higher correlation with that by the ClustalW than other methods. This observation further confirmed the effectiveness of the proposed method.

Table 5

The correlation coefficients for nine ND5 proteins of Gaussian Kernel similarity method and some state-of-the-art methods as compared with the ClustalW method.

Species	Our Method	Yao et al. (2014c)	Ellakkani and Mahran (2015)	Zhang et al. (2015)	Mu et al. (2016)	Liu et al. (2013)	Wu et al. (2010)	Huang and Hu (2013)	Yao et al. (2014b)
Human	0.96	0.93	−0.09	0.91	0.93	0.94	0.93	0.89	0.89
Gorilla	0.93	0.88	−0.03	0.92	0.93	0.93	0.91	0.93	0.85
C.Chim.	0.96	0.94	−0.11	0.93	0.91	0.94	0.91	0.95	0.86
P.Chim.	0.96	0.95	−0.11	0.91	0.93	0.93	0.76	0.91	0.77
Rat	0.96	0.95	0.72	0.92	0.93	0.84	0.63	0.93	0.87
Mouse	0.96	0.98	0.75	0.87	0.97	1.00	0.66	0.86	0.76
Opossum	0.99	0.94	0.99	0.99	0.93	0.89	0.52	0.92	0.93
F.Whale	0.98	0.91	0.16	0.92	0.93	0.89	0.53	0.92	0.87
B.Whale	0.98	0.93	0.15	0.92	0.96	0.87	0.69	0.93	0.90

Fig. 3

The correlation coefficients for nine ND5 proteins of Gaussian Kernel similarity method and some state-of-the-art methods as compared with the ClustalW method.

The correlation coefficients for nine ND5 proteins of Gaussian Kernel similarity method and some state-of-the-art methods as compared with the ClustalW method. The correlation coefficients for nine ND5 proteins of Gaussian Kernel similarity method and some state-of-the-art methods as compared with the ClustalW method.

Similarity analysis of thirty-five Coronavirus Spike Proteins

Coronaviruses are species of virus which are associated with respiratory, intestinal, liver, and neurological diseases. Generally, coronaviruses were divided into three groups. The first group and the second group come from mammalian, and the third group comes from poultry (chicken and turkey). To classify the SARS-CoV viruses and associate proteins with the virus virulence, the proposed method was utilized to analyze the coronaviruses spike proteins. Phylogenetic trees were built by Gaussian Kernel similarity, Cosine similarity and by the ClustalW (Fig. 4 a,b and Fig. 5 ). All the SARS-CoVs clustered in a new strain nearest to group II coronaviruses. This is consistent with the report that SARS-CoV represents a lineage that split off from the group II branch relatively late in coronavirus evolution (Snijder et al., 2003). All the groups of CoVs were also separated correctly by the analysis (Fig. 4). The results from Gaussian Kernel similarity and Cosine similarity analysis were comparable, with only little differences in SARS-CoVs classification. However, SARS-CoVs were not clearly distinguished by the ClustalW (Fig. 5). In conclusion, the results of Gaussian Kernel similarity and Cosine similarity analysis methods were in agreement with the ClustalW with improvement.

Fig. 4

Phylogenetic trees of thirty-five CoVPs constructed by Gaussian Kernel similarity (a) and Cosine similarity (b).

Fig. 5

Phylogenetic tree of thirty-five CoVPs achieved by the ClustalW.

Phylogenetic trees of thirty-five CoVPs constructed by Gaussian Kernel similarity (a) and Cosine similarity (b). Phylogenetic tree of thirty-five CoVPs achieved by the ClustalW.

The similarity analysis of twenty-four transferrin proteins

Iron is essential for various metabolic processes such as oxygen transfer, electron transport, DNA synthesis, etc. Transferrin (TF) is the major iron transporting protein in the plasma. Lactoferrin (LF) is an iron binding glycoprotein of the transferrin family (García-Montoya et al., 2012). Previous studies have demonstrated the phylogenetic relation between LFs and TFs (Chang, Wang, 2011, Ford, 2001, Yu, Zhang, Gutman, Shi, Dehmer, 2017). In this study, twenty-four previously published TFs were studied by Gaussian Kernel similarity and Cosine similarity analysis. The phylogenetic trees of twenty-four TFs were built by Gaussian Kernel similarity and Cosine similarity analysis (Fig. 6a and b). TF proteins and LF proteins were clustered into their corresponding branches. LF proteins were clustered into one branch and they were close to TFs of mammals. The TFs from mammals and salmonoids clustered into their corresponding branches, respectively. Our analysis displayed no misplaced and misclassified species. However, analysis performed by the ClustalW (Fig. 7) could not distinguish LFs from TFs. Thus, our method by Gaussian Kernel similarity and Cosine similarity analysis outperformed the multiple alignment method by the ClustalW.

Fig. 6

Phylogenetic trees of twenty-four TFs constructed by Gaussian Kernel similarity (a) and Cosine similarity (b).

Fig. 7

Phylogenetic tree of twenty-four TFs by the ClustalW.

Similarity analysis of 27 AF proteins

Antifreeze proteins (AFPs) play a vital role in the antifreeze effect of overwintering organisms. They have a wide range of applications in numerous fields, such as improving crop production and the quality of frozen foods. Here we generated phylogenetic trees by Gaussian Kernel similarity and Cosine similarity analysis method (Fig. 8); and by the ClustalW analysis (Fig. 9) on the twenty-seven AFPs. Gaussian Kernel similarity and Cosine similarity analysis accurately classified all species (Fig. 8), which outperformed the ClustalW analysis that divided the TM group into three groups (Fig. 9).

Fig. 8

Phylogenetic trees of 27AFPs constructed by Gaussian Kernel similarity (a) and Cosine similarity (b).

Fig. 9

Phylogenetic tree of twenty-seven AFPs by the ClustalW.

Conclusions

In this study, we presented for the first time the spectral radii of 20 AAs calculation, and a novel 2-D graphical representation using spectral radii of 20 AAs. This method offers the advantage in easy visibility and inspection of similarity/dissimilarity between proteins, which sets ground for numerical characterizations of proteins. Furthermore, it avoids loss of information and ensures the integrity of the information. The proposed graphical representation method satisfies all requirements of a useful graphical representation proposed in Randić et al. (2010). Phylogenetic trees of twenty-four TFs constructed by Gaussian Kernel similarity (a) and Cosine similarity (b). Phylogenetic tree of twenty-four TFs by the ClustalW. To characterize the protein sequence numerically, MSR, DSR, DSRT, DFG, DFGT and C of the sequence were extracted as the numerical features. The MSR, DSR and DFG features confirmed the integrity of information in different levels, while the DSRT and DFGT reflected dynamics features of protein sequences. In addition, fluctuation complexity reflected the non-linear feature of protein sequence. These features are based on the distributions of spectral radii and functional groups, and the fluctuation in net information gain of the sequence. Finally, we employed Gaussian Kernel similarity and Cosine similarity analysis to measure the similarity of protein sequences using the feature vector. The method was performed on the similarity analysis of protein sequences of four data sets: nine ND5, thirty-five CoVPs, twenty-four TFs and twenty-seven AFPs. As the features reflected the protein sequence effectively, both the Gaussian Kernel similarity and Cosine similarity models have obtain satisfying results. Results of nine ND5, thirty-five CoVPs, twenty-four TFs and twenty-seven AFPs were consistent with the ClustalW method with further improvements. When compared to other methods, the analysis presented in this study achieved higher correlation coefficients with the ClustalW for nine ND5, which confirmed the efficiency of the proposed approach. The simulations of nine ND5 and thirty-five CoVPs indicated that the results of Gaussian Kernel similarity were to certain extend more sensitive than that achieved by Cosine similarity analysis. In summary, we demonstrated that the proposed features and similarity models measured protein sequences efficiently. The obtained analysis was consistent with previously demonstrated evolution patterns. The proposed approach may therefore be applied in identification and classification of unknown species by protein sequences, as well as tracking the source of virus and designing drugs for disease therapy. Phylogenetic trees of 27AFPs constructed by Gaussian Kernel similarity (a) and Cosine similarity (b). Phylogenetic tree of twenty-seven AFPs by the ClustalW.

27 in total

1. Molecular evolution of transferrin: evidence for positive selection in salmonids.

Authors: M J Ford
Journal: Mol Biol Evol Date: 2001-04 Impact factor: 16.240

2. 20D-dynamic representation of protein sequences.

Authors: Agata Czerniecka; Dorota Bielińska-Wąż; Piotr Wąż; Tim Clark
Journal: Genomics Date: 2015-12-17 Impact factor: 5.736

Review 3. Using bioinformatics for drug target identification from the genome.

Authors: Zhenran Jiang; Yanhong Zhou
Journal: Am J Pharmacogenomics Date: 2005

4. WRF-TMH: predicting transmembrane helix by fusing composition index and physicochemical properties of amino acids.

Authors: Maqsood Hayat; Asifullah Khan
Journal: Amino Acids Date: 2013-03-14 Impact factor: 3.520

5. Primary structure similarity analysis of proteins sequences by a new graphical representation.

Authors: S C Xu; Z Li; S P Zhang; J L Hu
Journal: SAR QSAR Environ Res Date: 2014-09-22 Impact factor: 3.000

6. Protein sequence analysis by incorporating modified chaos game and physicochemical properties into Chou's general pseudo amino acid composition.

Authors: Chunrui Xu; Dandan Sun; Shenghui Liu; Yusen Zhang
Journal: J Theor Biol Date: 2016-06-29 Impact factor: 2.691

7. An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation.

Authors: A El-Lakkani; H Mahran
Journal: SAR QSAR Environ Res Date: 2015 Impact factor: 3.000

8. Protein Sequence Comparison Based on Physicochemical Properties and the Position-Feature Energy Matrix.

Authors: Lulu Yu; Yusen Zhang; Ivan Gutman; Yongtang Shi; Matthias Dehmer
Journal: Sci Rep Date: 2017-04-10 Impact factor: 4.379

Review 9. Lactoferrin a multiple bioactive protein: an overview.

Authors: Isui Abril García-Montoya; Tania Siqueiros Cendón; Sigifredo Arévalo-Gallegos; Quintín Rascón-Cruz
Journal: Biochim Biophys Acta Date: 2011-06-25

10. Unique and conserved features of genome and proteome of SARS-coronavirus, an early split-off from the coronavirus group 2 lineage.

Authors: Eric J Snijder; Peter J Bredenbeek; Jessika C Dobbe; Volker Thiel; John Ziebuhr; Leo L M Poon; Yi Guan; Mikhail Rozanov; Willy J M Spaan; Alexander E Gorbalenya
Journal: J Mol Biol Date: 2003-08-29 Impact factor: 5.469