Sequence comparison is one of the foundations in bioinformatics, which can be used to study evolutionary relations among the sequences. In this study, a 2D spectrum-like graphical representation of protein sequences is presented based on the hydrophobicity scale of amino acids. The frequencies of amplitudes of 4-subsequences are adopted to characterize a spectrum-like graph, and a 17D vector is used as the descriptor of protein sequence. The χ(2) value of compatibility test is performed. New similarity analysis approach is illustrated on the all protein sequences, which are encoded by the mitochondrion genome of 20 different species. Finally, comparison with the ClustalW method shows the utility of our method.
Sequence comparison is one of the foundations in bioinformatics, which can be used to study evolutionary relations among the sequences. In this study, a 2D spectrum-like graphical representation of protein sequences is presented based on the hydrophobicity scale of amino acids. The frequencies of amplitudes of 4-subsequences are adopted to characterize a spectrum-like graph, and a 17D vector is used as the descriptor of protein sequence. The χ(2) value of compatibility test is performed. New similarity analysis approach is illustrated on the all protein sequences, which are encoded by the mitochondrion genome of 20 different species. Finally, comparison with the ClustalW method shows the utility of our method.
Entities:
Keywords:
compatibility test; protein sequences; similarities/dissimilarities; spectral representation
Comparison of bio-sequences, such as DNA, RNA, and protein, is the origin of bioinformatics. Through the comparison, we can identify the similarity/dissimilarity of different species’ sequences. Many methods of technologies have been introduced like graphical representation of DNA/RNA and so on. Based on graphical representations, numerical characterization techniques offer a route toward quantitatively estimating the similarities/dissimilarities of sequences.1–13 The reason for the delay in the emergence of graphical representations of proteins is the increased complexity of biological strings built on a 20-letter alphabet (representing the 20 natural amino acids) in comparison with strings built from only four letters (representing DNA or RNA). According to the genetic code, Randić et al. and Bai and Wang14–17 gave some graphical representations and the sequence descriptors of proteins. Similar to existing graphical representation of DNA, in order to better compare the similarities/dissimilarities of proteins, we modified some graphical representations of proteins.18–21 With some physicochemical properties of 20 amino acids, the graphical representations of protein sequence have been introduced.22–26Measuring the similarity between categorical sequences is a fundamental process in many data mining applications. A key issue is extracting and making use of significant features hidden behind the chronological and structural dependencies found in these sequences. This measure can lead to a better understanding of the nature of these sequences. The most important known challenges presented by these data, which are only partially addressed by existing methods, are the following: (1) it is difficult to extract the information underlying the chronological dependencies of structural features which may have significant meaning; (2) the high computational cost involved is also an important problem; (3) this creates ambiguities and complications for the similarity measurement task, especially for sequences of significantly different lengths.In this study, we outline a novel spectrum-like graphical representation, which is based on the hydrophobicity property of amino acids, and introduce a novel strategy for sequence comparison according to the method of calculating the frequencies of all amplitudes of different species’ spectral graphs. We will make a comparison for all protein sequences in the mitochondria of 20 species.
Methods
Here we consider a physicochemical property which has important relations with the structure of proteins: hydrophobicity of amino acids. The distribution of hydrophobic amino acids in the primary sequences can be used as an indicator to predict the secondary structure of protein elements.27 In the following contents, we will construct the spectrum-like graphical representation of protein sequences.First, each amino acid is characterized by its own physicochemical properties. Twenty amino acids are simplified into two types28: hydrophobic amino acids H = {F, L, I, Y, M, W, V, A, P, C}; hydrophilic amino acids P = {S, N, K, D, R, T, H, Q, E, G}. Then twenty amino acids are further simplified into four types29: strong hydrophobic amino acids SH = {F, L, I, Y, W}; weak hydrophobic amino acids WH = {M, V, A, P, C}; strong hydrophilic amino acids SP = {S, N, K, D, R}; weak hydrophilic amino acids WP = {T, H, Q, E, G}.Thus, giving a protein sequence S = s1s2…s with N amino acids, we inspect it by stepping one amino acid at a time. For example, at the step i(i = 1, 2, …, N), S is transformed into d which may be 2, 1, −1, and −2. Then the digit sequence D = d1d2…d is obtained. In order to more clearly display the differences between hydrophobic amino acids and hydrophilic amino acids, during the construction of the digit sequence, we preset the value of properties:It is sometimes instructive to represent a random walk as a polygonal line, or path, in the plane, where the horizontal axis represents time and the vertical axis represents the value of {S}. Giving a sequence {S} of partial sums, we first plot the points (n, S), and then for each k, n, we connect (k, S) and (k+1, S+1) with a straight line segment. The length of a path is just the difference in the time values of the beginning and ending points on the path. So, d, d+1, d+2, d+3, four consecutive numbers are summed as the partial sums and the summations are the values of vertical axis and are considered as the amplitudes. When i is the value of horizontal axis and runs from 1 to N−3, we have the points P1(x1, y1), P2(x2, y2),…, P−3(x−3, y−3). Among them, x and y are calculated by the following formula:Connecting adjacent points, we obtain a spectrum-like graph of protein sequence.We will illustrate the current approach on two shorter segments of yeast protein Saccharomyces cerevisiae. Figure 1 shows the two spectral graphs, and the corresponding proteins are
Figure 1
The spectrum-like graphs of two protein fragments I and II of yeast Saccharomyces cerevisiae, having 30 amino acids.
Protein I: WTFESRNDPAKDPVILWLNGGPGCSSLTGL;Protein II: WFFESRNDPANDPIILWLNGGPGCSSFTGL.The digit sequence (d) and 4-subsequence (y) of the protein I are showed in Table 1.
Table 1
The digit sequence (d) and 4-subsequence (y) of the protein I.
i
seq
di
yi
i
seq
di
yi
1
W
2
2
16
L
2
4
2
T
−1
−2
17
W
2
1
3
F
2
−3
18
L
2
−2
4
E
−1
−7
19
N
−2
−3
5
S
−2
−8
20
G
−1
−2
6
R
−2
−5
21
G
−1
0
7
N
−2
−2
22
P
1
−1
8
D
−2
−2
23
G
−1
−4
9
P
1
−2
24
C
1
−1
10
A
1
−2
25
S
−2
−3
11
K
−2
−2
26
S
−2
−2
12
D
−2
2
27
F
2
2
13
P
1
6
28
T
−1
−
14
V
1
7
29
G
−1
−
15
I
2
8
30
L
2
−
Observing Figure 1, we know that the two curves are similar on the whole and have several same local sequences’ segments. In this method, the reason why we emphasize the same hydrophilic—hydrophobic amino acids is that they are more likely to form a similar or identical structure.In Figure 2, we apply the new spectral representation to the ND6 (NADH dehydrogenase subunit 6) proteins of nine species, human, gorilla, common chimpanzee, pigmy chimpanzee, blue whale, fin whale, rat, mouse, and opossum. Taking a closer look at Figure 2 and comparing the curves, we find that the curves of the ND6 proteins of human, gorilla, P. chimpanzee, and C. chimpanzee are more similar. Also, the ND6 protein graphs are more similar for F. whale, B. whale and rat, mouse too. In addition, we find ND6 protein of opossum is obviously different from the other species. Also their similarities/dissimilarities are consistent with the known fact of evolution.
Figure 2
The spectrum-like graphs of the ND6 proteins of nine eutherian species include those for human, gorilla, common chimpanzee, pigmy chimpanzee, blue whale, fin whale, rat, mouse, and opossum.
Unexpectedly, we find that most amplitudes of amino acid are greater than 0, which may mean that amino acids’ preferences are hydrophobic in the protein sequence according to the four classifications of amino acids. It is probably because hydrophobic amino acids have an important influence on protein structures.
Results/Discussion
Once we have a matrix to represent a sequence, numerous matrix invariants25,26,30–33 are used as descriptor of sequences. However, the computational complexity of these matrix invariants techniques is at least O(N2), which results in the main difficulty in computation. In this section, we overcome the difficulty and introduce a novel way to numerically characterize protein sequence and it is easy to implement. Their computational complexities are reduced to O(N), so it is easy to implement. In addition, the new sequence descriptor is linearly relative to the length of the sequences, so it is appropriate for sequences of significantly different lengths.When we construct the spectrum-like graph, we calculate the summation of four consecutive numbers of a digit sequence. The summations are considered as the amplitudes, which can be −8, −7, −6, −5, −4, −3, −2, −1, 0, 1, 2, 3, 4, 5, 6, 7, and 8. In order to obtain the numerical representation of protein sequences, we calculate the frequency of amplitude. Therefore, a protein sequence can be characterized by a 17D vector.The data set consists of 13 proteins (cytochrome oxidase subunits I, II, and III; cytochrome b apoenzyme; NADH dehydrogenase subunits 1–6 and 4L; ATP synthase subunits 6 and 8) encoded by the typical mitochondrial genome from mammalian species.The information of the 13 proteins is listed in Table 2. The 13 proteins are concatenated into one long amino acid sequence and analyzed as one protein sequence. Their frequencies of amplitudes are obtained and listed in Table 3. According to the results obtained in Table 3, we construct 17-component vectors of the spectral graphs corresponding to 9 species proteins, and then the 17-component vectors are first normalized. For a vector X, normalization means: Z = (X−Mean(X))/Std(X), Mean(X), means the mean of X and Std(X) is the standard deviation of X. In Table 4, the similarity/dissimilarity matrices for the nine species protein sequences are given, which are based on them Euclidean distances between the 17-component vectors normalized. We give two arbitrary sequences S1 and S2. In our approach, the Euclidian distance D(S1, S2) between the two vectors is
Table 2
The Information for all protein sequences in the mitochondria of 9 species.
HUMAN
GORILLA
P. CHIMP
C. CHIMP
F. WHALE
B. WHALE
RAT
MOUSE
OPOSSUM
ND1
CAA24026 (318)
BAA85277 (318)
BAA85294 (318)
BAA85268 (318)
CAA43444 (318)
CAA50995 (318)
CAA32954 (318)
CAA24080 (315)
CAA82677 (318)
ND2
CAA24027 (347)
BAA85278 (347)
BAA85295 (347)
BAA85269 (347)
CAA43445 (347)
CAA50996 (347)
CAA32955 (345)
CAA24081 (345)
CAA82678 (347)
COI
CAA24028 (513)
BAA85279 (513)
BAA85296 (513)
BAA85270 (513)
CAA43451 (516)
CAA50997 (516)
CAA32956 (514)
CAA24082 (514)
CAA82679 (513)
COII
CAA24029 (227)
BAA07303 (227)
BAA07312 (227)
BAA07299 (227)
CAA43452 (227)
CAA50998 (227)
CAA32957 (227)
CAA24083 (227)
CAA82680 (235)
ATP8
CAA24030 (68)
BAA07304 (68)
BAA07313 (68)
BAA07300 (68)
CAA43441 (63)
CAA50999 (63)
CAA32958 (67)
CAA24084 (67)
CAA82681 (69)
ATP6
CAA24031 (226)
BAA85280 (226)
BAA85297 (226)
BAA85271 (226)
CAA43442 (226)
CAA51000 (226)
CAA32959 (226)
CAA24085 (226)
CAA82682 (226)
COIII
CAA24032 (2 61)
BAA85281 (261)
BAA85298 (2 61)
BAA85272 (261)
CAA43453 (261)
CAA51001 (2 61)
CAA32960 (2 61)
CAA24090 (278)
CAA82683 (281)
ND3
CAA24033 (115)
BAA85282 (11 5)
BAA85299 (115)
BAA85273 (115)
CAA43446 (11 5)
CAA51002 (115)
CAA32961 (115)
CAA24086 (114)
CAA82684 (11 6)
ND4L
CAA24034 (98)
BAA07305 (98)
BAA07314 (98)
BAA07301 (98)
CAA43447 (98)
CAA51003 (98)
CAA32962 (98)
CAA24087 (97)
CAA82685 (98)
ND4
CAA24035 (459)
BAA85283 (459)
BAA85300 (459)
BA A85274 (459)
CAA43448 (459)
CAA51004 (459)
CAA32963 (459)
CAA24091 (474)
CAA82686 (474)
ND5
CAA24036 (603)
BAA07306 (603)
BAA07315 (603)
BAA07302 (603)
CAA43449 (606)
CAA51005 (606)
CAA32964 (610)
CAA24088 (607)
CAA82687 (602)
ND6
CAA24037 (174)
BAA07307 (174)
BAA85301 (174)
BAA85275 (174)
CAA43450 (175)
CAA51006 (175)
CAA32965 (172)
CAA24089 (172)
CAA82688 (16 8)
CYTB
CAA24038 (380)
BAA85284 (380)
BAA85302 (380)
BAA85276 (380)
CAA43443 (379)
CAA51007 (379)
CAA32966 (380)
CAA24092 (392)
CAA82689 (382)
Total length
3789
3789
3789
3789
3790
3790
3792
3728
3729
Table 3
The frequencies of amplitudes of spectral graphs for all proteins sequences in the mitochondrion of 9 different species.
f (yI)
HUMAN
GORILLA
P. CHIMP
C. CHIMP
F. WHALE
B. WHALE
RAT
MOUSE
OPOSSUM
f (−8)
0.0018
0.0018
0.0021
0.0021
0.0008
0.0011
0.0018
0.0032
0.0013
f (−7)
0.0040
0.0058
0.0040
0.0037
0.0063
0.0063
0.0071
0.0048
0.0084
f (−6)
0.0114
0.0100
0.0129
0.0129
0.0111
0.0106
0.0108
0.0100
0.0118
f (−5)
0.0145
0.0143
0.0148
0.0148
0.0177
0.0158
0.0161
0.0177
0.0154
f (−4)
0.0341
0.0365
0.0359
0.0351
0.0372
0.0378
0.0430
0.0412
0.0363
f (−3)
0.0520
0.0541
0.0534
0.0541
0.0494
0.0470
0.0546
0.0508
0.0523
f (−2)
0.0634
0.0565
0.0576
0.0571
0.0562
0.0576
0.0504
0.0526
0.0476
f (−1)
0.0726
0.0695
0.0716
0.0716
0.0673
0.0689
0.0697
0.0717
0.0753
f (0)
0.1062
0.1096
0.1091
0.1094
0.1128
0.1122
0.1124
0.1124
0.1163
f (1)
0.1220
0.1249
0.1236
0.1223
0.1180
0.1215
0.1196
0.1145
0.1147
f (2)
0.0914
0.0890
0.0919
0.0909
0.0877
0.0866
0.0823
0.0814
0.0826
f (3)
0.1112
0.1059
0.1072
0.1091
0.1109
0.1106
0.1174
0.1174
0.1181
f (4)
0.1233
0.1244
0.1244
0.1263
0.1252
0.1231
0.1306
0.1293
0.1283
f (5)
0.0713
0.0737
0.0737
0.0737
0.0700
0.0721
0.0678
0.0701
0.0646
f (6)
0.0465
0.0465
0.0444
0.0433
0.0544
0.0560
0.0454
0.0500
0.0481
f (7)
0.0523
0.0534
0.0515
0.0515
0.0523
0.0523
0.0483
0.0497
0.0562
f (8)
0.0219
0.0240
0.0219
0.0222
0.0227
0.0206
0.0224
0.0233
0.0227
Table 4
The similarity matrix of 9 species based on the frequencies of amplitudes.
SPECIES
GORILLA
P. CHIM PAN
C. CHIMPAN.
F. WHALE.
B. WHALE
RAT
MOUSE
OPOSSUM
Human
4.2144
2.7639
3.0017
5.5206
5.1463
6.9385
7.1704
7.4932
Gorilla
4.0165
4.0790
5.1489
5.7607
6.1994
6.9165
7.2921
P. Chimpan.
1.0975
5.6356
5.5562
6.4450
7.1025
7.5040
C. Chimpan.
5.6890
5.9505
6.0357
6.7764
7.1315
F. Whale
3.2861
5.5199
5.5947
6.0795
B. Whale
6.5392
6.6137
7.0378
Rat
4.0634
5.6101
Mouse
6.1929
The analysis of similarities/dissimilarities represented by the index of similarity/dissimilarity is based on the following assumption: the smaller the distance between two proteins is, the more the two proteins will be similar. We know that the smaller the index of similarity/dissimilarity is, the more similar the two proteins will be. The indexes of similarity/dissimilarity between the nine species are listed in Table 4.Observing Table 4, we can find that the smaller entries are associated with the pairs in group human, gorilla, P. chimpanzee, and C. chimpanzee; F. whale, B. whale; and rat, mouse. On the other hand, the larger entries in the similarity/dissimilarity matrix appear in the rows belonging to opossum. These results are consistent with the known conclusion of evolution.12,25We calculate the theory values of frequency for the amplitudes which are listed in Table 5. As the theory values are symmetrical, we only show one half. We intend to know whether the frequencies of amplitudes for the 13 proteins in the 9 species are consistent with the ratios of theory values. In Figure 6, we show the comparison charts of 13 proteins of human and the theory values. Then, we calculate the χ2 values:
Table 5
The theory values of frequency of the amplitudes.
yI
SPLIT
COMBINATORIAL NUMBER
THE THEORETICAL FREQUENCY
−8
{−2, −2, −2, −2}
C44=1
1/256 ≈ 0.00391
−7
{−2, −2, −2, −1}
C43=4
4/256 ≈ 0.01563
−6
{−2, −2, −1, −1}
C42=6
6/256 ≈ 0.02344
−5
{−2, −2, −2, 1}{−2, −1, −1, −1}
C43=4C41=4
8/256 ≈ 0.03125
−4
{−2, −2, −2, 2}{−2, −2, −1, −1}{−1, −1, −1, −1}
C41=4C41*C31=12C44=1
17/256 ≈ 0.06641
−3
{−2, −2, −1, 2}{−2, −1, −1, 1}
C41*C31= 12C41*C31=12
24/256 ≈ 0.09375
−2
{−2, −2, 1, 1}{−2, −1, −1, 2}{−1, −1, −1, 1}
C42=6C41*C31 =12C41=4
22/256 ≈ 0.08594
−1
{−2, −2, 1, 2}{−2, −1, 1, 1}{−1, −1, −1, 2}
C41*C31= 12C41*C31=12C41=4
28/256 ≈ 0.10938
0
{−2, −2, 2, 2}{−2, −1, 1, 2}{−1, −1, 1, 1}
C42=6C41*C31*C21=24C42=6
36/256 ≈ 0.14063
Figure 6
The distributions of amplitudes of 13 proteins of human and the theory value. Proteins include those for cytochrome oxidase subunits I, II, and III (COI, COII, and COIII); cytochrome b apoenzyme (CYTB); NADH dehydrogenase subunits 1–6 and 4 L (ND1, ND2, ND3, ND4, ND5, ND6, and ND4L); ATP synthase subunits 6 and 8 (ATP6 and ATP8).
The χ2 values of 13 proteins for 9 species are listed in Table 6. Each protein corresponding to one 17-component vector, so all the degrees of freedom are df = 17 − 1 = 16. Significance level is α = 0.01.
. Nearly all χ2 values are more than
in Table 6, so they are not consistent with the ratios of theory values. The amino acid sequences of proteins determine the protein structure and function. So their patterns are not expected to be random.
Table 6
The χ2 values for 13 proteins of 9 species.
SPECIES
ND1
ND2
COI
COII
ATP8
ATP6,
COIII
ND3
ND4L
ND4
ND5
ND6
CYTB
Human
191.16
124.23
209.09
103.63
20.98
179.72
85.36
247.70
73.39
214.32
158.50
80.81
229.47
Gorilla
208.26
130.95
208.19
117.59
30.21
177.31
94.77
197.36
80.11
229.64
220.32
74.96
250.04
P. Chimpan
177.87
107.64
205.79
107.14
17.95
187.76
88.52
177.68
65.03
222.80
189.38
70.46
278.48
C. Chimpan
189.97
124.54
208.68
107.14
17.47
200.75
86.82
173.44
75.55
232.19
176.92
75.50
306.85
F. Whale
170.14
139.65
195.96
45.17
45.68
128.02
80.64
154.67
75.31
333.77
187.34
114.51
331.96
B. Whale
171.69
122.01
204.06
47.08
44.93
117.65
82.78
154.67
87.32
322.66
191.10
93.92
239.08
Rat
190.92
114.59
175.99
31.61
41.38
126.04
101.83
120.74
26.18
268.47
170.42
153.39
312.43
Mouse
183.63
207.92
178.29
30.79
53.50
126.70
80.15
142.66
27.57
234.89
199.80
117.80
301.85
Opossum
232.56
107.19
220.40
64.89
31.87
135.80
116.69
189.43
76.05
201.68
207.93
169.48
212.37
Firstly, we will make a comparison for helicase protein sequences of 12 baculoviruses, including 3 group I alphabaculovirus: AcMNPV, BmNPV, RoMNPV; 6 group II alphabaculovirus: HearNPV, HzSNPV, MacoNPVA, MacoNPVB, HaSNPV, AgseNPV; 3 betabaculovirus: AdorGV, CpGV, CrleGV. Length and group information of these protein sequences are shown in Table 7. The phylogenetic tree of 12 helicase protein sequences is given in Figure 3. Their similarities/dissimilarities are consistent with classification of these baculovirus proteins.34–36
Table 7
Length and group information of helicase protein sequences of 12 baculovirus.
GENUS (GROUP)
VIRUS NAME
ABBREVIATION
ACCESSION NO.
LENGTH
Alphabaculovirus (Group I NPVs)
Autographa californica MNPV
AcMNPV
AAA66725
1221
Bombyx mori NPV
BmNPV
AAC63764
1222
Rachiplusia ou MNPV
RoMNPV
AAN28013
1221
Alphabaculovirus (Group II NPVs)
Helicoverpa armigera NPV
HearNPV
AEN04007
1253
Helicoverpa zea SNPV
HzSNPV
AAL56093
1253
Mamestra configurata NPVA
MacoNPVA
AAM09201
1212
Mamestra configurata NPVB
MacoNPVB
AAM95079
1209
Helicoverpa armigera SNPV
HaSNPV
AAG53827
1253
Agrotis segetum NPV
AgseNPV
AAZ38246
1213
Betabaculovirus (GVs)
Adoxophyles orona GV
AdorGV
AAP85713
1138
Cydia pomonella GV
CpGV
AAK70750
1131
Cryptophlebia leucotreta GV
CrleGV
AAQ21676
1128
Figure 3
The phylogenetic tree based on protein sequences of 12 baculoviruses. Sequences include those for AcMNPV, BmNPV, RoMNPV, HearNPV, HzSNPV, MacoNPVA, MacoNPVB, HaSNPV, AgseNPV, AdorGV, CpGV, and CrleGV.
To further verify the validity of our approach, we have done an experiment on a dataset of the 13 proteins encoded by the same strand of the mitochondrial genome from 20 eutherian species: human (Homo sapiens), C. chimpanzee (Pan troglodytes), P. chimpanzee (Pan paniscus), gorilla (Gorillagorilla), orangutan (Pongo pygmaeus), gibbon (Hylobates lar), baboon (Papio hamadryas), horse (Equus caballus), white rhinoceros (Ceratotherium simum), harbor seal (Phoca vitulina), gray seal (Halichoerus grypus), cat (Felis catus), F. whale (Balaenoptera physalus), B. whale (Balaenoptera musculus), cow (Bos taurus), rat (Rattus norvegicus), mouse (Mus musculus), opossum (Didelphis virginiana), wallaroo (Macropus robustus), and platypus (Ornithorhynchus anatinus). Note that we have kept rodent species to murids only and marsupials and monotremes are being used as out-group. The phylogenetic tree of 20 species is given in Figure 4. We also construct a phylogenetic tree by the ClustalW method.37 The result is shown in Figure 5.
Figure 4
The phylogenetic tree of 20 eutherian species based on our method. Phylogeny was based on analysis of the combined sequences of 13 proteins encoded by the same strand of the mitochondrial genome. Sequences include those for human, common chimpanzee, pigmy chimpanzee, gorilla, orangutan, gibbon, baboon, horse, white rhinoceros, harbor seal, gray seal, cat, fin whale, blue whale, cow, rat, mouse, opossum, wallaroo, and platypus. The sequences of opossum, wallaroo, and platypus were used as out-group.
Figure 5
The phylogenetic tree of 20 eutherian species based on ClustalW. Phylogeny was based on analysis of the combined sequences of 13 proteins encoded by the same strand of the mitochondrial genome. Sequences include those for human, common chimpanzee, pigmy chimpanzee, gorilla, orangutan, gibbon, baboon, horse, white rhinoceros, harbor seal, gray seal, cat, fin whale, blue whale, cow, rat, mouse, opossum, wallaroo, and platypus. The sequences of opossum, wallaroo, and platypus were used as out-group.
Comparing Figures 4 and 5, we can find that: (1) they all distinguish the marsupials and monotremes, rodents, ferungulates, and primates; (2) it has been debated which two of the three main groups of placental mammals are closely related: primates, ferungulates, and rodents. Figure 4 supports the suggestion that primates and ferungulates are more closely related, whereas Figure 5 shows that primates and rodents are more closely related; (3) in Figure 5, opossum, wallaroo, and platypus as the out-group, was nearly clustered to rodents. The result of Figure 4 is consistent with the known conclusion of evolution and others’ partial results38,39 except for the opossum, so our method is more advantageous in this regard.To show the efficiency of the proposed approach, based on different protein families, we further make a comparison with the widely used methods, EMBOSS water – pairwise sequence alignment. Then, we test some families by the two methods, including 13 protein families encoded by the same strand of the mitochondrial genome, UDP glucuronosyltransferase family proteins (including the same genus but different species), and so on. The test results show that the similarity distances or scores by different methods are almost in an agreement with each other. Furthermore, for longer protein sequences the test results by the two methods are more consistent.
Conclusions
The graphical techniques of biological sequences have been used as a very powerful tool for the visualization and analysis of protein sequences. Based on the hydrophobicity of amino acids, a new spectral representation of proteins is introduced, in this study.We present a spectrum-like graphical representation of protein sequences, which are based on a significant physicochemical property. The chemical or physical property of amino acids will also be useful to study and solve some bioinformatics problems. The advantage of our approach is that it allows visual inspection of data, which helps recognize major similarities among different proteins, and even protein structures.For long protein sequences, the frequencies are easily computed and can be used to numerically characterize protein sequences, and the examination of similarity/dissimilarity illustrates the utility of the approach. The computational complexity of alignment method and matrix invariant technique is at least O(N2). Our method does not require multiple sequence alignments and greatly reduces the computational complexity at the same time.Our approach also gives novel numerical characterization of proteins. One is based on the frequencies of amplitudes of spectral graphs and the other is based on the χ2, which are used to analyze the similarity of protein sequences. Also, both computational scientists and molecular biologists can use them to analyze protein sequences efficiently.Theory values of frequencies of amplitudes are calculated. The results of the compatibility test show that the distribution of hydrophilic—hydrophobic amino acids may have special biological significance. To a certain degree, our method can extract the information underlying the chronological dependencies of structural features and is successfully applied to sequences comprising similar structural features in chronologically different positions. Also, the other physicochemical properties of amino acids will also be useful to study and solve some bioinformatics problems.
Authors: Y Cao; A Janke; P J Waddell; M Westerman; O Takenaka; S Murata; N Okada; S Pääbo; M Hasegawa Journal: J Mol Evol Date: 1998-09 Impact factor: 2.395