Literature DB >> 25002811

Similarity/Dissimilarity analysis of protein sequences based on a new spectrum-like graphical representation.

Yuhua Yao¹, Shoujiang Yan¹, Huimin Xu¹, Jianning Han¹, Xuying Nan¹, Ping-An He¹, Qi Dai¹.

Abstract

Sequence comparison is one of the foundations in bioinformatics, which can be used to study evolutionary relations among the sequences. In this study, a 2D spectrum-like graphical representation of protein sequences is presented based on the hydrophobicity scale of amino acids. The frequencies of amplitudes of 4-subsequences are adopted to characterize a spectrum-like graph, and a 17D vector is used as the descriptor of protein sequence. The χ(2) value of compatibility test is performed. New similarity analysis approach is illustrated on the all protein sequences, which are encoded by the mitochondrion genome of 20 different species. Finally, comparison with the ClustalW method shows the utility of our method.

Entities: CellLine Chemical Disease Gene Species

Keywords: compatibility test; protein sequences; similarities/dissimilarities; spectral representation

Year: 2014 PMID： 25002811 PMCID： PMC4068907 DOI： 10.4137/EBO.S14713

Source DB: PubMed Journal: Evol Bioinform Online ISSN： 1176-9343 Impact factor: 1.625

Introduction

Comparison of bio-sequences, such as DNA, RNA, and protein, is the origin of bioinformatics. Through the comparison, we can identify the similarity/dissimilarity of different species’ sequences. Many methods of technologies have been introduced like graphical representation of DNA/RNA and so on. Based on graphical representations, numerical characterization techniques offer a route toward quantitatively estimating the similarities/dissimilarities of sequences.1–13 The reason for the delay in the emergence of graphical representations of proteins is the increased complexity of biological strings built on a 20-letter alphabet (representing the 20 natural amino acids) in comparison with strings built from only four letters (representing DNA or RNA). According to the genetic code, Randić et al. and Bai and Wang14–17 gave some graphical representations and the sequence descriptors of proteins. Similar to existing graphical representation of DNA, in order to better compare the similarities/dissimilarities of proteins, we modified some graphical representations of proteins.18–21 With some physicochemical properties of 20 amino acids, the graphical representations of protein sequence have been introduced.22–26 Measuring the similarity between categorical sequences is a fundamental process in many data mining applications. A key issue is extracting and making use of significant features hidden behind the chronological and structural dependencies found in these sequences. This measure can lead to a better understanding of the nature of these sequences. The most important known challenges presented by these data, which are only partially addressed by existing methods, are the following: (1) it is difficult to extract the information underlying the chronological dependencies of structural features which may have significant meaning; (2) the high computational cost involved is also an important problem; (3) this creates ambiguities and complications for the similarity measurement task, especially for sequences of significantly different lengths. In this study, we outline a novel spectrum-like graphical representation, which is based on the hydrophobicity property of amino acids, and introduce a novel strategy for sequence comparison according to the method of calculating the frequencies of all amplitudes of different species’ spectral graphs. We will make a comparison for all protein sequences in the mitochondria of 20 species.

Methods

Here we consider a physicochemical property which has important relations with the structure of proteins: hydrophobicity of amino acids. The distribution of hydrophobic amino acids in the primary sequences can be used as an indicator to predict the secondary structure of protein elements.27 In the following contents, we will construct the spectrum-like graphical representation of protein sequences. First, each amino acid is characterized by its own physicochemical properties. Twenty amino acids are simplified into two types28: hydrophobic amino acids H = {F, L, I, Y, M, W, V, A, P, C}; hydrophilic amino acids P = {S, N, K, D, R, T, H, Q, E, G}. Then twenty amino acids are further simplified into four types29: strong hydrophobic amino acids SH = {F, L, I, Y, W}; weak hydrophobic amino acids WH = {M, V, A, P, C}; strong hydrophilic amino acids SP = {S, N, K, D, R}; weak hydrophilic amino acids WP = {T, H, Q, E, G}. Thus, giving a protein sequence S = s1s2…s with N amino acids, we inspect it by stepping one amino acid at a time. For example, at the step i(i = 1, 2, …, N), S is transformed into d which may be 2, 1, −1, and −2. Then the digit sequence D = d1d2…d is obtained. In order to more clearly display the differences between hydrophobic amino acids and hydrophilic amino acids, during the construction of the digit sequence, we preset the value of properties: It is sometimes instructive to represent a random walk as a polygonal line, or path, in the plane, where the horizontal axis represents time and the vertical axis represents the value of {S}. Giving a sequence {S} of partial sums, we first plot the points (n, S), and then for each k, n, we connect (k, S) and (k+1, S+1) with a straight line segment. The length of a path is just the difference in the time values of the beginning and ending points on the path. So, d, d+1, d+2, d+3, four consecutive numbers are summed as the partial sums and the summations are the values of vertical axis and are considered as the amplitudes. When i is the value of horizontal axis and runs from 1 to N−3, we have the points P1(x1, y1), P2(x2, y2),…, P−3(x−3, y−3). Among them, x and y are calculated by the following formula: Connecting adjacent points, we obtain a spectrum-like graph of protein sequence. We will illustrate the current approach on two shorter segments of yeast protein Saccharomyces cerevisiae. Figure 1 shows the two spectral graphs, and the corresponding proteins are

Figure 1

The spectrum-like graphs of two protein fragments I and II of yeast Saccharomyces cerevisiae, having 30 amino acids.

Protein I: WTFESRNDPAKDPVILWLNGGPGCSSLTGL; Protein II: WFFESRNDPANDPIILWLNGGPGCSSFTGL. The digit sequence (d) and 4-subsequence (y) of the protein I are showed in Table 1.

Table 1

The digit sequence (d) and 4-subsequence (y) of the protein I.

i	seq	d_i	y_i	i	seq	d_i	y_i
1	W	2	2	16	L	2	4
2	T	−1	−2	17	W	2	1
3	F	2	−3	18	L	2	−2
4	E	−1	−7	19	N	−2	−3
5	S	−2	−8	20	G	−1	−2
6	R	−2	−5	21	G	−1	0
7	N	−2	−2	22	P	1	−1
8	D	−2	−2	23	G	−1	−4
9	P	1	−2	24	C	1	−1
10	A	1	−2	25	S	−2	−3
11	K	−2	−2	26	S	−2	−2
12	D	−2	2	27	F	2	2
13	P	1	6	28	T	−1	−
14	V	1	7	29	G	−1	−
15	I	2	8	30	L	2	−

Observing Figure 1, we know that the two curves are similar on the whole and have several same local sequences’ segments. In this method, the reason why we emphasize the same hydrophilic—hydrophobic amino acids is that they are more likely to form a similar or identical structure. In Figure 2, we apply the new spectral representation to the ND6 (NADH dehydrogenase subunit 6) proteins of nine species, human, gorilla, common chimpanzee, pigmy chimpanzee, blue whale, fin whale, rat, mouse, and opossum. Taking a closer look at Figure 2 and comparing the curves, we find that the curves of the ND6 proteins of human, gorilla, P. chimpanzee, and C. chimpanzee are more similar. Also, the ND6 protein graphs are more similar for F. whale, B. whale and rat, mouse too. In addition, we find ND6 protein of opossum is obviously different from the other species. Also their similarities/dissimilarities are consistent with the known fact of evolution.

Figure 2

The spectrum-like graphs of the ND6 proteins of nine eutherian species include those for human, gorilla, common chimpanzee, pigmy chimpanzee, blue whale, fin whale, rat, mouse, and opossum.

Unexpectedly, we find that most amplitudes of amino acid are greater than 0, which may mean that amino acids’ preferences are hydrophobic in the protein sequence according to the four classifications of amino acids. It is probably because hydrophobic amino acids have an important influence on protein structures.

Results/Discussion

Once we have a matrix to represent a sequence, numerous matrix invariants25,26,30–33 are used as descriptor of sequences. However, the computational complexity of these matrix invariants techniques is at least O(N2), which results in the main difficulty in computation. In this section, we overcome the difficulty and introduce a novel way to numerically characterize protein sequence and it is easy to implement. Their computational complexities are reduced to O(N), so it is easy to implement. In addition, the new sequence descriptor is linearly relative to the length of the sequences, so it is appropriate for sequences of significantly different lengths. When we construct the spectrum-like graph, we calculate the summation of four consecutive numbers of a digit sequence. The summations are considered as the amplitudes, which can be −8, −7, −6, −5, −4, −3, −2, −1, 0, 1, 2, 3, 4, 5, 6, 7, and 8. In order to obtain the numerical representation of protein sequences, we calculate the frequency of amplitude. Therefore, a protein sequence can be characterized by a 17D vector. The data set consists of 13 proteins (cytochrome oxidase subunits I, II, and III; cytochrome b apoenzyme; NADH dehydrogenase subunits 1–6 and 4L; ATP synthase subunits 6 and 8) encoded by the typical mitochondrial genome from mammalian species. The information of the 13 proteins is listed in Table 2. The 13 proteins are concatenated into one long amino acid sequence and analyzed as one protein sequence. Their frequencies of amplitudes are obtained and listed in Table 3. According to the results obtained in Table 3, we construct 17-component vectors of the spectral graphs corresponding to 9 species proteins, and then the 17-component vectors are first normalized. For a vector X, normalization means: Z = (X−Mean(X))/Std(X), Mean(X), means the mean of X and Std(X) is the standard deviation of X. In Table 4, the similarity/dissimilarity matrices for the nine species protein sequences are given, which are based on them Euclidean distances between the 17-component vectors normalized. We give two arbitrary sequences S1 and S2. In our approach, the Euclidian distance D(S1, S2) between the two vectors is

Table 2

The Information for all protein sequences in the mitochondria of 9 species.

	HUMAN	GORILLA	P. CHIMP	C. CHIMP	F. WHALE	B. WHALE	RAT	MOUSE	OPOSSUM
ND1	CAA24026 (318)	BAA85277 (318)	BAA85294 (318)	BAA85268 (318)	CAA43444 (318)	CAA50995 (318)	CAA32954 (318)	CAA24080 (315)	CAA82677 (318)
ND2	CAA24027 (347)	BAA85278 (347)	BAA85295 (347)	BAA85269 (347)	CAA43445 (347)	CAA50996 (347)	CAA32955 (345)	CAA24081 (345)	CAA82678 (347)
COI	CAA24028 (513)	BAA85279 (513)	BAA85296 (513)	BAA85270 (513)	CAA43451 (516)	CAA50997 (516)	CAA32956 (514)	CAA24082 (514)	CAA82679 (513)
COII	CAA24029 (227)	BAA07303 (227)	BAA07312 (227)	BAA07299 (227)	CAA43452 (227)	CAA50998 (227)	CAA32957 (227)	CAA24083 (227)	CAA82680 (235)
ATP8	CAA24030 (68)	BAA07304 (68)	BAA07313 (68)	BAA07300 (68)	CAA43441 (63)	CAA50999 (63)	CAA32958 (67)	CAA24084 (67)	CAA82681 (69)
ATP6	CAA24031 (226)	BAA85280 (226)	BAA85297 (226)	BAA85271 (226)	CAA43442 (226)	CAA51000 (226)	CAA32959 (226)	CAA24085 (226)	CAA82682 (226)
COIII	CAA24032 (2 61)	BAA85281 (261)	BAA85298 (2 61)	BAA85272 (261)	CAA43453 (261)	CAA51001 (2 61)	CAA32960 (2 61)	CAA24090 (278)	CAA82683 (281)
ND3	CAA24033 (115)	BAA85282 (11 5)	BAA85299 (115)	BAA85273 (115)	CAA43446 (11 5)	CAA51002 (115)	CAA32961 (115)	CAA24086 (114)	CAA82684 (11 6)
ND4L	CAA24034 (98)	BAA07305 (98)	BAA07314 (98)	BAA07301 (98)	CAA43447 (98)	CAA51003 (98)	CAA32962 (98)	CAA24087 (97)	CAA82685 (98)
ND4	CAA24035 (459)	BAA85283 (459)	BAA85300 (459)	BA A85274 (459)	CAA43448 (459)	CAA51004 (459)	CAA32963 (459)	CAA24091 (474)	CAA82686 (474)
ND5	CAA24036 (603)	BAA07306 (603)	BAA07315 (603)	BAA07302 (603)	CAA43449 (606)	CAA51005 (606)	CAA32964 (610)	CAA24088 (607)	CAA82687 (602)
ND6	CAA24037 (174)	BAA07307 (174)	BAA85301 (174)	BAA85275 (174)	CAA43450 (175)	CAA51006 (175)	CAA32965 (172)	CAA24089 (172)	CAA82688 (16 8)
CYTB	CAA24038 (380)	BAA85284 (380)	BAA85302 (380)	BAA85276 (380)	CAA43443 (379)	CAA51007 (379)	CAA32966 (380)	CAA24092 (392)	CAA82689 (382)
Total length	3789	3789	3789	3789	3790	3790	3792	3728	3729

Table 3

The frequencies of amplitudes of spectral graphs for all proteins sequences in the mitochondrion of 9 different species.

f (y_I)	HUMAN	GORILLA	P. CHIMP	C. CHIMP	F. WHALE	B. WHALE	RAT	MOUSE	OPOSSUM
f (−8)	0.0018	0.0018	0.0021	0.0021	0.0008	0.0011	0.0018	0.0032	0.0013
f (−7)	0.0040	0.0058	0.0040	0.0037	0.0063	0.0063	0.0071	0.0048	0.0084
f (−6)	0.0114	0.0100	0.0129	0.0129	0.0111	0.0106	0.0108	0.0100	0.0118
f (−5)	0.0145	0.0143	0.0148	0.0148	0.0177	0.0158	0.0161	0.0177	0.0154
f (−4)	0.0341	0.0365	0.0359	0.0351	0.0372	0.0378	0.0430	0.0412	0.0363
f (−3)	0.0520	0.0541	0.0534	0.0541	0.0494	0.0470	0.0546	0.0508	0.0523
f (−2)	0.0634	0.0565	0.0576	0.0571	0.0562	0.0576	0.0504	0.0526	0.0476
f (−1)	0.0726	0.0695	0.0716	0.0716	0.0673	0.0689	0.0697	0.0717	0.0753
f (0)	0.1062	0.1096	0.1091	0.1094	0.1128	0.1122	0.1124	0.1124	0.1163
f (1)	0.1220	0.1249	0.1236	0.1223	0.1180	0.1215	0.1196	0.1145	0.1147
f (2)	0.0914	0.0890	0.0919	0.0909	0.0877	0.0866	0.0823	0.0814	0.0826
f (3)	0.1112	0.1059	0.1072	0.1091	0.1109	0.1106	0.1174	0.1174	0.1181
f (4)	0.1233	0.1244	0.1244	0.1263	0.1252	0.1231	0.1306	0.1293	0.1283
f (5)	0.0713	0.0737	0.0737	0.0737	0.0700	0.0721	0.0678	0.0701	0.0646
f (6)	0.0465	0.0465	0.0444	0.0433	0.0544	0.0560	0.0454	0.0500	0.0481
f (7)	0.0523	0.0534	0.0515	0.0515	0.0523	0.0523	0.0483	0.0497	0.0562
f (8)	0.0219	0.0240	0.0219	0.0222	0.0227	0.0206	0.0224	0.0233	0.0227

Table 4

The similarity matrix of 9 species based on the frequencies of amplitudes.

SPECIES	GORILLA	P. CHIM PAN	C. CHIMPAN.	F. WHALE.	B. WHALE	RAT	MOUSE	OPOSSUM
Human	4.2144	2.7639	3.0017	5.5206	5.1463	6.9385	7.1704	7.4932
Gorilla		4.0165	4.0790	5.1489	5.7607	6.1994	6.9165	7.2921
P. Chimpan.			1.0975	5.6356	5.5562	6.4450	7.1025	7.5040
C. Chimpan.				5.6890	5.9505	6.0357	6.7764	7.1315
F. Whale					3.2861	5.5199	5.5947	6.0795
B. Whale						6.5392	6.6137	7.0378
Rat							4.0634	5.6101
Mouse								6.1929

The analysis of similarities/dissimilarities represented by the index of similarity/dissimilarity is based on the following assumption: the smaller the distance between two proteins is, the more the two proteins will be similar. We know that the smaller the index of similarity/dissimilarity is, the more similar the two proteins will be. The indexes of similarity/dissimilarity between the nine species are listed in Table 4. Observing Table 4, we can find that the smaller entries are associated with the pairs in group human, gorilla, P. chimpanzee, and C. chimpanzee; F. whale, B. whale; and rat, mouse. On the other hand, the larger entries in the similarity/dissimilarity matrix appear in the rows belonging to opossum. These results are consistent with the known conclusion of evolution.12,25 We calculate the theory values of frequency for the amplitudes which are listed in Table 5. As the theory values are symmetrical, we only show one half. We intend to know whether the frequencies of amplitudes for the 13 proteins in the 9 species are consistent with the ratios of theory values. In Figure 6, we show the comparison charts of 13 proteins of human and the theory values. Then, we calculate the χ2 values:

Table 5

The theory values of frequency of the amplitudes.

y_I	SPLIT	COMBINATORIAL NUMBER	THE THEORETICAL FREQUENCY
−8	{−2, −2, −2, −2}	C44=1	1/256 ≈ 0.00391
−7	{−2, −2, −2, −1}	C43=4	4/256 ≈ 0.01563
−6	{−2, −2, −1, −1}	C42=6	6/256 ≈ 0.02344
−5	{−2, −2, −2, 1}{−2, −1, −1, −1}	C43=4C41=4	8/256 ≈ 0.03125
−4	{−2, −2, −2, 2}{−2, −2, −1, −1}{−1, −1, −1, −1}	C41=4C41*C31=12C44=1	17/256 ≈ 0.06641
−3	{−2, −2, −1, 2}{−2, −1, −1, 1}	C41C31= 12C41C31=12	24/256 ≈ 0.09375
−2	{−2, −2, 1, 1}{−2, −1, −1, 2}{−1, −1, −1, 1}	C42=6C41*C31 =12C41=4	22/256 ≈ 0.08594
−1	{−2, −2, 1, 2}{−2, −1, 1, 1}{−1, −1, −1, 2}	C41C31= 12C41C31=12C41=4	28/256 ≈ 0.10938
0	{−2, −2, 2, 2}{−2, −1, 1, 2}{−1, −1, 1, 1}	C42=6C41C31C21=24C42=6	36/256 ≈ 0.14063

Figure 6

The distributions of amplitudes of 13 proteins of human and the theory value. Proteins include those for cytochrome oxidase subunits I, II, and III (COI, COII, and COIII); cytochrome b apoenzyme (CYTB); NADH dehydrogenase subunits 1–6 and 4 L (ND1, ND2, ND3, ND4, ND5, ND6, and ND4L); ATP synthase subunits 6 and 8 (ATP6 and ATP8).

The χ2 values of 13 proteins for 9 species are listed in Table 6. Each protein corresponding to one 17-component vector, so all the degrees of freedom are df = 17 − 1 = 16. Significance level is α = 0.01. . Nearly all χ2 values are more than in Table 6, so they are not consistent with the ratios of theory values. The amino acid sequences of proteins determine the protein structure and function. So their patterns are not expected to be random.

Table 6

The χ2 values for 13 proteins of 9 species.

SPECIES	ND1	ND2	COI	COII	ATP8	ATP6,	COIII	ND3	ND4L	ND4	ND5	ND6	CYTB
Human	191.16	124.23	209.09	103.63	20.98	179.72	85.36	247.70	73.39	214.32	158.50	80.81	229.47
Gorilla	208.26	130.95	208.19	117.59	30.21	177.31	94.77	197.36	80.11	229.64	220.32	74.96	250.04
P. Chimpan	177.87	107.64	205.79	107.14	17.95	187.76	88.52	177.68	65.03	222.80	189.38	70.46	278.48
C. Chimpan	189.97	124.54	208.68	107.14	17.47	200.75	86.82	173.44	75.55	232.19	176.92	75.50	306.85
F. Whale	170.14	139.65	195.96	45.17	45.68	128.02	80.64	154.67	75.31	333.77	187.34	114.51	331.96
B. Whale	171.69	122.01	204.06	47.08	44.93	117.65	82.78	154.67	87.32	322.66	191.10	93.92	239.08
Rat	190.92	114.59	175.99	31.61	41.38	126.04	101.83	120.74	26.18	268.47	170.42	153.39	312.43
Mouse	183.63	207.92	178.29	30.79	53.50	126.70	80.15	142.66	27.57	234.89	199.80	117.80	301.85
Opossum	232.56	107.19	220.40	64.89	31.87	135.80	116.69	189.43	76.05	201.68	207.93	169.48	212.37

Firstly, we will make a comparison for helicase protein sequences of 12 baculoviruses, including 3 group I alphabaculovirus: AcMNPV, BmNPV, RoMNPV; 6 group II alphabaculovirus: HearNPV, HzSNPV, MacoNPVA, MacoNPVB, HaSNPV, AgseNPV; 3 betabaculovirus: AdorGV, CpGV, CrleGV. Length and group information of these protein sequences are shown in Table 7. The phylogenetic tree of 12 helicase protein sequences is given in Figure 3. Their similarities/dissimilarities are consistent with classification of these baculovirus proteins.34–36

Table 7

Length and group information of helicase protein sequences of 12 baculovirus.

GENUS (GROUP)	VIRUS NAME	ABBREVIATION	ACCESSION NO.	LENGTH
Alphabaculovirus (Group I NPVs)	Autographa californica MNPV	AcMNPV	AAA66725	1221
	Bombyx mori NPV	BmNPV	AAC63764	1222
	Rachiplusia ou MNPV	RoMNPV	AAN28013	1221
Alphabaculovirus (Group II NPVs)	Helicoverpa armigera NPV	HearNPV	AEN04007	1253
	Helicoverpa zea SNPV	HzSNPV	AAL56093	1253
	Mamestra configurata NPVA	MacoNPVA	AAM09201	1212
	Mamestra configurata NPVB	MacoNPVB	AAM95079	1209
	Helicoverpa armigera SNPV	HaSNPV	AAG53827	1253
	Agrotis segetum NPV	AgseNPV	AAZ38246	1213
Betabaculovirus (GVs)	Adoxophyles orona GV	AdorGV	AAP85713	1138
	Cydia pomonella GV	CpGV	AAK70750	1131
	Cryptophlebia leucotreta GV	CrleGV	AAQ21676	1128

Figure 3

The phylogenetic tree based on protein sequences of 12 baculoviruses. Sequences include those for AcMNPV, BmNPV, RoMNPV, HearNPV, HzSNPV, MacoNPVA, MacoNPVB, HaSNPV, AgseNPV, AdorGV, CpGV, and CrleGV.

To further verify the validity of our approach, we have done an experiment on a dataset of the 13 proteins encoded by the same strand of the mitochondrial genome from 20 eutherian species: human (Homo sapiens), C. chimpanzee (Pan troglodytes), P. chimpanzee (Pan paniscus), gorilla (Gorilla gorilla), orangutan (Pongo pygmaeus), gibbon (Hylobates lar), baboon (Papio hamadryas), horse (Equus caballus), white rhinoceros (Ceratotherium simum), harbor seal (Phoca vitulina), gray seal (Halichoerus grypus), cat (Felis catus), F. whale (Balaenoptera physalus), B. whale (Balaenoptera musculus), cow (Bos taurus), rat (Rattus norvegicus), mouse (Mus musculus), opossum (Didelphis virginiana), wallaroo (Macropus robustus), and platypus (Ornithorhynchus anatinus). Note that we have kept rodent species to murids only and marsupials and monotremes are being used as out-group. The phylogenetic tree of 20 species is given in Figure 4. We also construct a phylogenetic tree by the ClustalW method.37 The result is shown in Figure 5.

Figure 4

The phylogenetic tree of 20 eutherian species based on our method. Phylogeny was based on analysis of the combined sequences of 13 proteins encoded by the same strand of the mitochondrial genome. Sequences include those for human, common chimpanzee, pigmy chimpanzee, gorilla, orangutan, gibbon, baboon, horse, white rhinoceros, harbor seal, gray seal, cat, fin whale, blue whale, cow, rat, mouse, opossum, wallaroo, and platypus. The sequences of opossum, wallaroo, and platypus were used as out-group.

Figure 5

The phylogenetic tree of 20 eutherian species based on ClustalW. Phylogeny was based on analysis of the combined sequences of 13 proteins encoded by the same strand of the mitochondrial genome. Sequences include those for human, common chimpanzee, pigmy chimpanzee, gorilla, orangutan, gibbon, baboon, horse, white rhinoceros, harbor seal, gray seal, cat, fin whale, blue whale, cow, rat, mouse, opossum, wallaroo, and platypus. The sequences of opossum, wallaroo, and platypus were used as out-group.

Comparing Figures 4 and 5, we can find that: (1) they all distinguish the marsupials and monotremes, rodents, ferungulates, and primates; (2) it has been debated which two of the three main groups of placental mammals are closely related: primates, ferungulates, and rodents. Figure 4 supports the suggestion that primates and ferungulates are more closely related, whereas Figure 5 shows that primates and rodents are more closely related; (3) in Figure 5, opossum, wallaroo, and platypus as the out-group, was nearly clustered to rodents. The result of Figure 4 is consistent with the known conclusion of evolution and others’ partial results38,39 except for the opossum, so our method is more advantageous in this regard. To show the efficiency of the proposed approach, based on different protein families, we further make a comparison with the widely used methods, EMBOSS water – pairwise sequence alignment. Then, we test some families by the two methods, including 13 protein families encoded by the same strand of the mitochondrial genome, UDP glucuronosyltransferase family proteins (including the same genus but different species), and so on. The test results show that the similarity distances or scores by different methods are almost in an agreement with each other. Furthermore, for longer protein sequences the test results by the two methods are more consistent.

Conclusions

The graphical techniques of biological sequences have been used as a very powerful tool for the visualization and analysis of protein sequences. Based on the hydrophobicity of amino acids, a new spectral representation of proteins is introduced, in this study. We present a spectrum-like graphical representation of protein sequences, which are based on a significant physicochemical property. The chemical or physical property of amino acids will also be useful to study and solve some bioinformatics problems. The advantage of our approach is that it allows visual inspection of data, which helps recognize major similarities among different proteins, and even protein structures. For long protein sequences, the frequencies are easily computed and can be used to numerically characterize protein sequences, and the examination of similarity/dissimilarity illustrates the utility of the approach. The computational complexity of alignment method and matrix invariant technique is at least O(N2). Our method does not require multiple sequence alignments and greatly reduces the computational complexity at the same time. Our approach also gives novel numerical characterization of proteins. One is based on the frequencies of amplitudes of spectral graphs and the other is based on the χ2, which are used to analyze the similarity of protein sequences. Also, both computational scientists and molecular biologists can use them to analyze protein sequences efficiently. Theory values of frequencies of amplitudes are calculated. The results of the compatibility test show that the distribution of hydrophilic—hydrophobic amino acids may have special biological significance. To a certain degree, our method can extract the information underlying the chronological dependencies of structural features and is successfully applied to sequences comprising similar structural features in chronologically different positions. Also, the other physicochemical properties of amino acids will also be useful to study and solve some bioinformatics problems.

28 in total

1. Ancient coevolution of baculoviruses and their insect hosts.

Authors: Elisabeth A Herniou; Julie A Olszewski; David R O'Reilly; Jenny S Cory
Journal: J Virol Date: 2004-04 Impact factor: 5.103

2. A new graphical representation of similarity/dissimilarity studies of protein sequences.

Authors: P He
Journal: SAR QSAR Environ Res Date: 2010-07 Impact factor: 3.000

Review 3. On graphical and numerical representation of protein sequences.

Authors: Fenglan Bai; Tianming Wang
Journal: J Biomol Struct Dyn Date: 2006-04

4. Analysis of similarity/dissimilarity of DNA sequences based on a class of 2D graphical representation.

Authors: Yu-Hua Yao; Qi Dai; Xu-Ying Nan; Ping-An He; Zuo-Ming Nie; Song-Ping Zhou; Yao-Zhou Zhang
Journal: J Comput Chem Date: 2008-07-30 Impact factor: 3.376

5. Analysis of similarity/dissimilarity of protein sequences.

Authors: Yu-Hua Yao; Qi Dai; Chun Li; Ping-An He; Xu-Ying Nan; Yao-Zhou Zhang
Journal: Proteins Date: 2008-12

6. Spectral representation of reduced protein models.

Authors: M Randić; M Vracko; M Novic; D Plavsić
Journal: SAR QSAR Environ Res Date: 2009-07 Impact factor: 3.000

7. Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders.

Authors: Y Cao; A Janke; P J Waddell; M Westerman; O Takenaka; S Murata; N Okada; S Pääbo; M Hasegawa
Journal: J Mol Evol Date: 1998-09 Impact factor: 2.395

8. Simpler DNA sequence representations.

Authors: M A Gates
Journal: Nature Date: 1985 Jul 18-24 Impact factor: 49.962

9. An extensive analysis on the global codon usage pattern of baculoviruses.

Authors: Yue Jiang; Fei Deng; Hualin Wang; Zhihong Hu
Journal: Arch Virol Date: 2008-11-23 Impact factor: 2.574

10. Complete sequence and organization of Antheraea pernyi nucleopolyhedrovirus, a dr-rich baculovirus.

Authors: Zuo-Ming Nie; Zhi-Fang Zhang; Dan Wang; Ping-An He; Cai-Ying Jiang; Li Song; Fang Chen; Jie Xu; Ling Yang; Lin-Lin Yu; Jian Chen; Zheng-Bing Lv; Jing-Jing Lu; Xiang-Fu Wu; Yao-Zhou Zhang
Journal: BMC Genomics Date: 2007-07-24 Impact factor: 3.969

3 in total