Literature DB >> 31886192

Measuring Similarity among Protein Sequences Using a New Descriptor.

Mervat M Abo-Elkhier1, Marwa A Abd Elwahaab1, Moheb I Abo El Maaty1.   

Abstract

The comparison of protein sequences according to similarity is a fundamental aspect of today's biomedical research. With the developments of sequencing technologies, a large number of protein sequences increase exponentially in the public databases. Famous sequences' comparison methods are alignment based. They generally give excellent results when the sequences under study are closely related and they are time consuming. Herein, a new alignment-free method is introduced. Our technique depends on a new graphical representation and descriptor. The graphical representation of protein sequence is a simple way to visualize protein sequences. The descriptor compresses the primary sequence into a single vector composed of only two values. Our approach gives good results with both short and long sequences within a little computation time. It is applied on nine beta globin, nine ND5 (NADH dehydrogenase subunit 5), and 24 spike protein sequences. Correlation and significance analyses are also introduced to compare our similarity/dissimilarity results with others' approaches, results, and sequence homology.
Copyright © 2019 Mervat M. Abo-Elkhier et al.

Entities:  

Mesh:

Substances:

Year:  2019        PMID: 31886192      PMCID: PMC6893242          DOI: 10.1155/2019/2796971

Source DB:  PubMed          Journal:  Biomed Res Int            Impact factor:   3.411


1. Introduction

Information encoded in the genome of any organism plays a central role in defining the life of that organism. The nucleotide sequence that forms any gene is translated into its corresponding amino acid sequence. This sequence of amino acids becomes functional only when it adopts its tertiary structure. Experimental methods such as X-ray diffraction and nuclear magnetic resonance are considered authoritative ways for obtaining proteins' structure and function. These experimental methods are very expensive and time consuming. Therefore, computational methods for predicting protein structure have become very useful. Proteins with similar sequences are usually homologous, typically displaying similar 3D structure and function. Sequence alignment is the first step of 3D structure prediction for protein sequences. Alignment approaches are classified into alignment-based and alignment-free methods. BLAST (basic local alignment search tool) and ClustalW are the most widely used computer programs for alignment-based approaches [1-3]. Results of these programs provide an approximate solution to the protein alignment problem. On the other hand, many alignment-free approaches are proposed for sequence comparison. Most biological sequence analysis methods still have weaknesses, including having low precision and being time consuming [4, 5]. Similarity/dissimilarity analysis of biological sequences is used to extract information stored in the protein sequence. Many mathematical schemes have been proposed to this end. Graphical representations of biological sequences identify the information content of any sequence to help biologists choose another complex theoretical or experimental method. Graphical representation provides not only visual qualitative inspection of gene data but also mathematical characterizations through objects such as matrices. Some 2D and 3D graphical representations are created by selecting a geometrical object that is used to describe nucleic acid bases or residues [6-10]. Others are based on assigning vectors of two or three components to nucleic acid bases or amino acids [11-17]. Adjacency matrices are also introduced in some articles [18-21], where an exact solution is obtained to the protein alignment problem. Additional methods use discrete Fourier transform (DFT) in which DNA sequences are mapped into four binary indicator sequences, followed by the application of DFT on these indicator sequences to transform them into a frequency domain [22, 23]. Dynamic representation is used to remove degeneracies in the previously mentioned approaches [24-31]. Another method is based on the simplified pulse-coupled neural network (S-PCNN) and Huffman coding where the triplet code was used as a code bit to transform DNA sequence into numerical sequence [32]. In this study, we introduce a new alignment-free method for protein sequences. Each amino acid in the protein sequence is represented by a number, and a new 2D graphical representation is suggested. A new descriptor is introduced, comprising a vector composed of the mean and standard deviation of the total numbers of each protein sequence (, SA). Our graphical representation eliminates degeneracy and has no loss of information. It is suitable for both short and long sequences. As a proof of concept, our approach is applied on nine beta globin protein sequences and nine ND5 (NADH dehydrogenase subunit 5) protein sequences. It can be applied on any sequence length with the same efficiency. Correlation and significance analyses are introduced among our results, along with PID% [15] and ClustalW [33] to demonstrate the utility of our approach.

2. Dataset, Technology, and Tools

All the protein sequences used in this study were downloaded from The National Center for Biotechnology Information (NCBI) “https://www.ncbi.nlm.nih.gov” as FASTA files. These FASTA files are imported into Wolfram Mathematica 8 where all the results and figures are produced. They are nine beta globin, nine ND5 (NADH dehydrogenase subunit 5), and 24 coronaviruses protein sequences as illustrated in Tables 1–3, respectively. These datasets are selected to be different in length.
Table 1

Nine beta globin protein sequences.

No.SpeciesIDLength
1GorillaCAA43421121
2ChimpCAA26204125
3HumanAAA16334147
4RatCAA29887147
5MouseCAA24101147
6GuttaACH46399147
7DuckCAA33756147
8GallusCAA23700147
9OpossumAAA30976147
Table 2

Nine ND5 protein sequences.

No.SpeciesIDLength
1HumanAP_000649603
2GorillaNP_008222603
3Common chimpanzeeNP_008196603
4Pygmy chimpanzeeNP_008209603
5Fin whaleNP_006899606
6Blue whaleNP_007066606
7RatAP_004902610
8MouseNP_904338607
9OpossumNP_007105602
Table 3

The 24 coronaviruses protein sequences.

No.Access no.AbbreviationLengthClass
1CAB91145TGEVG1447I
2NP058424TGEV1447I
3AAK38656PEDVC1383I
4NP598310PEDV1383I
5NP937950HCoVOC431361II
6AAK83356BCoVE1363II
7AAL57308BCoVL1363II
8AAA66399BCoVM1363II
9AAL40400BCoVQ1363II
10AAS00080IBVC1169III
11NP 040831IBV1162III
12AAS10463GD03T00131255SARS-CoV
13AAU93318PC41271255SARS-CoV
14AAV49720PC41371255SARS-CoV
15AAU93319PC42051255SARS-CoV
16AAU04646civet0071255SARS-CoV
17AAU04649civet0101255SARS-CoV
18AAV91631A0221255SARS-CoV
19AAP51227GD011255SARS-CoV
20AAS00003GZ021255SARS-CoV
21AAP30030BJ011255SARS-CoV
22AAP50485FRA1255SARS-CoV
23AAP41037TOR21255SARS-CoV
24AAQ01597TaiwanTC11255SARS-CoV

3. 2D Graphical Representation

A new 2D graphical representation is introduced. Each amino acid in any protein sequence is represented by the suggested intensity  Y(i) and intensity level  A(i). The intensity (Y(i)) of each amino in the sequence depends on its abundance and location in the different sequences. It is calculated using where f is the frequency of amino acid x in the sequence, number of times of x/N. N is the protein sequence length, number of residues in protein sequence. i is the position of each amino acid x in a sequence. Then, the intensity level A(i) of each amino acid (x) in the sequence is calculated by using the natural logarithm function as in Therefore, each amino acid has its own intensity level which is a vector of N elements according to equation (2). Finally, the combined intensity level of the protein sequence A(i) is obtained by the summation of the 20 intensity levels' vectors A(i) of the protein sequence by using equation (3). The combined intensity level A(i) is also a vector of N elements: Each amino acid has its own graph. Now, twenty graphs are obtained for each sequence of the 20 different amino acids. The combined graph is obtained by combining these 20 graphs within a single graph. This combined intensity level is our new 2D graphical representation. Our approach is first applied on two short segments of protein from “yeast Saccharomyces cerevisiae”: Protein I: “WTFESRNDPAKDPVILWLNGGPGCS‐SLTGL” Protein II: “WFFESRNDPANDPIILWLNGGPGCS‐SFTGL” These two short proteins consist of 30 amino acids each. The two sequences are different in amino acids at positions 2, 11, 14, and 27. The values Y(i) and A(i) for each amino acid in the two sequences are calculated. For protein I, the G amino acid is repeated four times in the protein sequence. These four repeats occur in positions 20, 21, 23, and 29. The frequency, fG, equals (4/30). By substituting in equations (1) and (2), the results of Y(i) and A(i) are presented in Table 4.
Table 4

The intensity and intensity level vectors of the two short segments of protein from “yeast Saccharomyces cerevisiae” protein sequences.

i 123456789101112131415
Y G(i)000000000000000
A G(i)000000000000000
i 161718192021222324252627282930
Y G(i)00002.672.8003.07000003.870
A G(i)000024.223.7022.80000020.50
By summing the values of A(i) for all amino acids in protein I, the total value of A(i) is obtained, as shown in Figure 1(a). The position i of each amino acid is located on the x-axis, and the total intensity level A(i) is located on the y-axis. Figures 1(a) and 1(b) show the intensity level of protein I and protein II, respectively. Of note, the two graphs have different A(i) values at positions 2, 11, 14, and 27.
Figure 1

2D graphical representation of the “combined intensity level” of two short segments of protein of “yeast Saccharomyces cerevisiae”. (a) Protein 1. (b) Protein 2.

We next apply our approach on nine beta globin and nine ND5 (NADH dehydrogenase subunit 5) protein sequences, which are illustrated in Tables 1 and 2. The 2D graphical representation for human, chimpanzee, and opossum beta globin protein sequences is illustrated in Figures 2(a)–2(c), respectively. The 2D graphical representations for fin whale and rat ND5 protein sequences are illustrated in Figures 3(a) and 3(b), respectively.
Figure 2

2D graphical representation the “combined intensity level” of beta globin protein sequences. (a) Human, (b) chimpanzee, and (c) opossum.

Figure 3

2D graphical representation of “combined intensity level” of ND5 protein sequences. (a) Fin whale and (b) rat.

We finally apply our approach on 24 coronaviruses protein sequences which are illustrated in Table 3. The 2D graphical representation of TGEVG from class I and GD03T0013 from SARS_CoV protein sequences is illustrated in Figures 4(a) and 4(b) respectively.
Figure 4

2D graphical representation of “combined intensity level” of TGEVG and GD03T0013 coronaviruses protein sequences. (a) Fin whale and (b) rat.

4. Protein Sequence Descriptor

Mathematical descriptors help in recognizing major differences among similar protein sequences quantitatively. A new descriptor for protein sequences is suggested, which is a vector composed of the arithmetic mean and standard deviation SA of the combined intensity level value A(i) of the protein sequence. They are evaluated according to the following equations: This descriptor compresses the information from primary protein sequences into a single vector composed of only two values. The beta globin, ND5, and coronaviruses protein sequence descriptors are illustrated in Tables 5–7, respectively.
Table 5

Mean and standard deviation descriptor of beta globin protein sequences.

No.Species A¯t SAt
1Gorilla36.80311.744
2Chimp36.91211.586
3Human37.14511.505
4Rat37.72111.399
5Mouse37.69511.727
6Gutta38.04611.537
7Duck38.24411.399
8Gallus38.34911.169
9Opossum38.41810.944
Table 6

Mean and standard deviation descriptor of ND5 protein sequences.

No.Species A¯t SAt
1Human37.30012.267
2Gorilla37.33812.223
3Pigmy chimpanzee37.24912.091
4Common chimpanzee37.25112.277
5Fin whale37.54011.961
6Blue whale37.53412.027
7Rat37.38511.621
8Mouse37.32811.562
9Opossum37.55811.419
Table 7

Mean and standard deviation descriptor of the coronaviruses protein sequences.

Abb.Class no.MeanStandard deviation
1TGEVGI38.64310.9412
2TGEVI38.64310.9412
3PEDVCI38.45211.1723
4PEDVI38.45211.1723
5HCoVOC43II38.70310.7564
6BCoVEII38.66810.6803
7BCoVLII38.67810.6846
8BCoVMII38.69810.7755
9BCoVQII38.71410.7656
10IBVCIII38.60110.6271
11IBVIII38.65410.6458
12GD03T0013SARS-CoV38.83310.5783
13PC4127SARS-CoV38.83810.5744
14PC4137SARS-CoV38.83210.5785
15PC4205SARS-CoV38.83810.5733
16civet007SARS-CoV38.83110.587
17civet010SARS-CoV38.83310.5829
18A022SARS-CoV38.82910.5892
19GD01SARS-CoV38.82110.5946
20GZ02SARS-CoV38.82410.5867
21BJ01SARS-CoV38.81610.5912
22FRASARS-CoV38.818910.5875
23TOR2SARS-CoV38.818610.5932
24TaiwanTC1SARS-CoV38.817610.5928
Table 7 shows that the mean of all 24 coronaviruses is around 38.7 and with a range from 38.601 to 38.838 while their standard deviation varies according to their class. They are divided into four classes. The first four viruses belong to class I. The fifth to the ninth coronaviruses belong to class II. Class III contains the tenth and eleventh viruses. The rest viruses from the 12th to the 24th belong to SARS-CoV. According to our approach, the standard deviation of class I ranges from 10.94 to 11.17. Class II's standard deviation ranges from 10.68 to 10.77. Class III's standard deviation has values from 10.6271 to 10.6458. SARS-CoV's standard deviation almost equals 10.58. The resulting standard deviation values of the 24 coronaviruses classify them correctly to the four classes. The coronaviruses classes' ranges according to our approach are shown in Figure 5.
Figure 5

The four classes of the 24 coronaviruses protein sequences based on their standard deviation of the combined intensity level.

5. Similarity/Dissimilarity Analysis

To compare the species' protein sequences, the Euclidean distance among species' descriptors is evaluated. For example, the human beta globin protein sequence's descriptor is (37.145, 11.505) and the chimpanzee beta globin protein sequence's descriptor is (36.912, 11.586). To measure the degree of similarity between human and chimpanzee, the Euclidean distance between these vectors is evaluated. The similarity/dissimilarity matrices of beta globin and ND5 protein sequences are illustrated in Tables 8 and 9, respectively. Table 8 results show that human and chimpanzee sequences are similar. There is also striking similarity between mouse and rat sequences, while human and opossum sequences are obviously dissimilar. Table 9 results show that pigmy chimpanzee, common chimpanzee, human, and gorilla ND5 protein sequences are similar, while the blue whale is similar to the fin whale, and mouse is similar to rat. Similar to the other sequence, human and opossum are still dissimilar. However, our algorithm cannot measure the degree of similarity very well for pigmy chimpanzee. The distance between human and pigmy chimpanzee is 0.1826, while the distance between human and gorilla is 0.0575, as shown in Table 9. The results of both Tables 8 and 9 are approximately comparable to previous reports [13, 15, 21, 33–39].
Table 8

Similarity/dissimilarity analysis among nine beta globin protein sequences.

HumanGorillaChimpLemurMouseRatOpossumDuckGallus
Human00.4170.2460.4610.5930.5861.3911.1041.25
Gorilla00.1920.1040.8920.9801.8021.4811.649
Chimp00.2700.7950.8291.6371.3441.496
Lemur00.8700.9931.8231.4791.660
Mouse00.3291.0660.6390.860
Rat00.8330.5230.669
Opossum00.4880.236
Duck00.253
Gallus0
Table 9

Similarity/dissimilarity analysis among nine ND5 protein sequences.

HumanGorillaP. chimpC. chimpF. whaleB. whaleRatMouseOpossum
Human00.05750.18260.05030.38850.33490.65090.70540.8853
Gorilla00.15900.10210.33110.27750.60390.66170.8332
P. chimp00.18550.31840.29180.48900.53510.7389
C. chimp00.42810.37760.66890.71890.9102
F. whale00.06630.37370.45240.5417
B. whale00.43250.50920.6079
Rat00.08260.2656
Mouse00.2705
Opossum0

6. The Phylogenetic Tree of the Protein Sequences Based on Our Method

We got the phylogenetic trees of beta globin and ND5 protein sequences by applying the UPGMA (Unweighted Pair Group Method with Arithmetic Mean). The phylogenetic tree based on Tables 8 and 9 of our method is presented in Figures 6 and 7, respectively. Figure 6 proves the utility of our similarity/dissimilarity analysis for beta globin protein sequences. Figure 7 shows our analysis of similarity/dissimilarity of ND5. It is mentioned that our algorithm cannot measure the degree of similarity very well for pigmy chimpanzee with human. This appears of course in Figure 7. The P. chimp branch should be close to C. chimp. Despite this error, the tree shows that human, common chimpanzee, pigmy chimpanzee, and gorilla belong to the same cluster. To check the effect of this error on our algorithm, the results of our algorithm are compared to sequence homology. A correlation and significance analysis is also provided.
Figure 6

The phylogenetic tree of the nine beta globin protein sequences based on our method.

Figure 7

The phylogenetic tree of the nine ND5 protein sequences based on our method.

7. Our Method Compared to PID% and ClustalW Results

The results of our algorithm are compared to the sequence homology by two methods. First, we use the Smith Waterman algorithm to calculate the number of identical residues in each pair of protein sequences [15]. The results of the PID% of nine beta globin sequences are illustrated as a similarity/dissimilarity matrix in Table 10. The larger PID% represents the more similar protein sequences. A correlation and significance analysis is provided to compare our approach in Table 8 with PID% in Table 10. The correlation of the two sets of data is sufficiently strong when the correlation coefficient (r) is greater than 0.7. The negative sign of (r) indicates that when the first data set increases, the second data set decreases. We then assess statistical significance for correlation coefficient values greater than 0.7 to ensure that they likely do not occur by chance. Our sample set is composed of nine protein sequences. Therefore, we use 7 degrees of freedom. A t-value of 2.385 or greater indicates that a less than 0.05 chance of the results occurred by coincidence. The results for correlation coefficients and t-values for our approach are illustrated in Table 11.
Table 10

The similarity distance for nine different species of beta globin proteins calculated by the PID%.

HumanGorillaChimpLemurMouseRatOpossumDuckGallus
Human10098.34793.666.66760.54459.18444.89839.45638.776
Gorilla10095.04166.94258.67855.37246.28139.66938.843
Chimp10061.655.252.40.36.836.
Lemur10053.06148.97940.13631.97331.293
Mouse10078.23139.45635.37435.374
Rat10033.33336.05434.014
Opossum10040.13639.456
Duck10094.5578
Gallus100
Table 11

The correlation and significance analysis between our similarity analysis results of beta globin protein sequences in Table 8 and PID% similarity matrix in Table 10.

Correlation coeff. (r) of our approach t-value of our approach
Human−0.89745.3806
Gorilla−0.87154.7015
Chimp−0.91055.8266
Lemur−0.91516.0024
Mouse−0.84894.2505
Rat−0.72482.7830
Opossum−0.5318
Duck−0.71692.7209
Gallus−0.6960
Second, ClustalW is a widely used system for aligning any number of homologous nucleotides or protein sequences [33]. The ClustalW program's distance matrix of nine ND5 protein sequences is illustrated in Table 12. Correlation and significance analyses are also provided to compare our approach in Table 9 with ClustalW results in Table 12. The results of the correlation and significance analyses of our approach and other approaches [15, 33] are illustrated in Table 13. Our sample set of ND5 is also composed of nine protein sequences. Therefore, we use 7 degrees of freedom and a t-value of 2.385 or greater. Despite the unusual result for pigmy chimpanzee that appeared in Table 9, the correlation coefficient of pigmy chimpanzee in our similarity matrix and clustalW matrix is 0.8811. This value likely does not occur by chance, as the t-value equals 4.928, as illustrated in Table 13. The comparison between our results and both PID% and ClustalW and other approaches' results indicate the utility of our approach.
Table 12

The similarity distance for nine different species of ND5 proteins calculated by the ClustalW.

HumanGorillaP. chimpC. chimpF. whaleB. whaleRatMouseOpossum
Human010.77.16.94141.350.248.950.4
Gorilla09.79.942.742.451.449.954
P. chimp05.140.140.150.248.950.1
C. chimp040.440.450.849.651.4
F. whale03.545.346.852.7
B. whale04545.952.7
Rat025.954
Mouse050.8
Opossum0
Table 13

The correlation and significance analysis between our similarity analysis results of ND5 protein sequences in Tables 9 and 7 in [33] and Table 3 in [15] and ClustalW similarity matrix in Table 12.

Correlation coeff. (r) of our approach t-value of our approachCorrelation coeff. (r) of [33] t-value of [33]Correlation coeff. (r) of [15] (Table 3) t-value of [15] (Table 3)
Human0.91596.03890.78193.31810.94197.4169
Gorilla0.90625.66920.76303.12290.93637.0524
P. chimp0.88114.92880.78563.35880.87554.7944
C. chimp0.93456.94820.78083.30690.94487.6311
F. whale0.967410.1090.83604.03140.81463.7160
B. whale0.92396.38750.84304.14630.6593
Rat0.80483.58710.92136.26630.6479
Mouse0.81123.66990.63910.6308
Opossum0.63780.42990.4772

8. Conclusions

A new graphical representation of protein sequences is introduced. It is the combined intensity level of the 20 amino acids composing any protein sequence. Each amino acid in a given protein sequence has its own intensity and intensity level. They are vectors of N elements as N is the protein sequence length. The combined intensity level is then computed and graphed to represent any protein sequence graphically. Our 2D graphical representation effectively displays differences between protein sequences without degeneracies. The graph does not overlap or intersect with itself. Our new descriptor suggested a vector of two elements, which are the mean and standard deviation of the combined intensity level ( and SA). A similarity/dissimilarity analysis is evaluated by computing Euclidean distance between each two species' descriptors. Examination of similarity/dissimilarity among nine beta globin, nine ND5, and 24 coronaviruses protein sequences provided good results compared to previous approaches. The suggested approach is effective for both short and long sequences, and the computations are very simple. Furthermore, loss of sequence information is avoided. Correlation and significance analyses with PID% and ClustalW are also introduced to show the utility of our approach.
  19 in total

1.  20D-dynamic representation of protein sequences.

Authors:  Agata Czerniecka; Dorota Bielińska-Wąż; Piotr Wąż; Tim Clark
Journal:  Genomics       Date:  2015-12-17       Impact factor: 5.736

2.  DNA sequence representation without degeneracy.

Authors:  Stephen S-T Yau; Jiasong Wang; Amir Niknejad; Chaoxiao Lu; Ning Jin; Yee-Kin Ho
Journal:  Nucleic Acids Res       Date:  2003-06-15       Impact factor: 16.971

3.  The graphical representation of protein sequences based on the physicochemical properties and its applications.

Authors:  Ping-An He; Yan-Ping Zhang; Yu-Hua Yao; Yi-Fa Tang; Xu-Ying Nan
Journal:  J Comput Chem       Date:  2010-08       Impact factor: 3.376

4.  Protein map: an alignment-free sequence comparison method based on various properties of amino acids.

Authors:  Chenglong Yu; Shiu-Yuen Cheng; Rong L He; Stephen S-T Yau
Journal:  Gene       Date:  2011-07-19       Impact factor: 3.688

5.  Protein alignment: Exact versus approximate. An illustration.

Authors:  Milan Randić; Tomaž Pisanski
Journal:  J Comput Chem       Date:  2015-03-19       Impact factor: 3.376

6.  Spectral-dynamic representation of DNA sequences.

Authors:  Dorota Bielińska-Wąż; Piotr Wąż
Journal:  J Biomed Inform       Date:  2017-06-03       Impact factor: 6.317

7.  Similarity/dissimilarity calculation methods of DNA sequences: A survey.

Authors:  Xin Jin; Qian Jiang; Yanyan Chen; Shin-Jye Lee; Rencan Nie; Shaowen Yao; Dongming Zhou; Kangjian He
Journal:  J Mol Graph Model       Date:  2017-07-20       Impact factor: 2.518

8.  3D-dynamic representation of DNA sequences.

Authors:  Piotr Wąż; Dorota Bielińska-Wąż
Journal:  J Mol Model       Date:  2014-02-25       Impact factor: 1.810

9.  Novel numerical characterization of protein sequences based on individual amino acid and its application.

Authors:  Yan-ping Zhang; Ya-jun Sheng; Wei Zheng; Ping-an He; Ji-shuo Ruan
Journal:  Biomed Res Int       Date:  2015-02-02       Impact factor: 3.411

10.  A new method to cluster DNA sequences using Fourier power spectrum.

Authors:  Tung Hoang; Changchuan Yin; Hui Zheng; Chenglong Yu; Rong Lucy He; Stephen S-T Yau
Journal:  J Theor Biol       Date:  2015-03-05       Impact factor: 2.691

View more
  1 in total

1.  An accurate alignment-free protein sequence comparator based on physicochemical properties of amino acids.

Authors:  Saeedeh Akbari Rokn Abadi; Azam Sadat Abdosalehi; Faezeh Pouyamehr; Somayyeh Koohi
Journal:  Sci Rep       Date:  2022-07-01       Impact factor: 4.996

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.