Literature DB >> 20221427

A rapid method for characterization of protein relatedness using feature vectors.

Kareem Carr1, Eleanor Murray, Ebenezer Armah, Rong L He, Stephen S-T Yau.   

Abstract

We propose a feature vector approach to characterize the variation in large data sets of biological sequences. Each candidate sequence produces a single feature vector constructed with the number and location of amino acids or nucleic acids in the sequence. The feature vector characterizes the distance between the actual sequence and a model of a theoretical sequence based on the binomial and uniform distributions. This method is distinctive in that it does not rely on sequence alignment for determining protein relatedness, allowing the user to visualize the relationships within a set of proteins without making a priori assumptions about those proteins. We apply our method to two large families of proteins: protein kinase C, and globins, including hemoglobins and myoglobins. We interpret the high-dimensional feature vectors using principal components analysis and agglomerative hierarchical clustering. We find that the feature vector retains much of the information about the original sequence. By using principal component analysis to extract information from collections of feature vectors, we are able to quickly identify the nature of variation in a collection of proteins. Where collections are phylogenetically or functionally related, this is easily detected. Hierarchical agglomerative clustering provides a means of constructing cladograms from the feature vector output.

Entities:  

Mesh:

Substances:

Year:  2010        PMID: 20221427      PMCID: PMC2832692          DOI: 10.1371/journal.pone.0009550

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Recent advances in biotechnology have allowed sequencing of millions of proteins from a wide spectrum of organisms and this information is rapidly becoming accessible to any researcher with an internet connection. For instance, the UniProtKB database currently contains over 7 million protein sequences and is updated every three weeks [1]. Current protein alignment methods are often slow and require assumptions about relatedness and evolutionary mechanisms [2]. In order to make use of the vast amount of protein data available, a method for quickly delineating large numbers of proteins into related types is necessary. As a solution, we propose a method for quantifying the frequency and position of amino acids within a protein, and demonstrate the ease, rapidity and usefulness of this technique for uncovering phylogenetic and functional relationships within protein families, using protein kinase C (PKC), hemoglobin and myoglobin as examples. One of us (S.Y.) previously designed three types of parameters for use in clustering amino acid sequences [3]. Here, we reconceptualize these parameters, making them more statistical in character and more applicable to measuring protein similarity. The new parameters measure the degree to which the distribution of amino acids in a particular protein deviates from a theoretical protein containing an equal number of residues but undergoing neutral evolution. This measure allows us not only to characterize the extent to which the distribution of amino acids in a protein deviates from the expected but also the distance between proteins. Our theoretical model of the distribution of amino acids in a protein assumes that for any given amino acid, the locations of residues of that amino acid are distributed uniformly and the number of residues is distributed binomially. In contrast to previous methods of constructing distance measures to determine protein relatedness [4], our method does not require performing multiple sequence alignment. Instead, this method creates measures of protein relatedness based on the distribution of amino acids in the proteins. Differences in these measures are then used to determine the difference between two proteins. In doing so, this method requires no assumptions about the way in which certain amino acids may be inserted or deleted. This allows us to look at protein difference in an abstract way, without making assumptions about the mechanisms by which these differences may arise.

Results

Implementation of the Feature Vectors

The selection pressure to which proteins are subjected affects both the number of residues in the protein [5] and the identity and location of these residues. By using a feature vector, which is an ordered list of numbers that characterize the distribution of amino acids in a protein, we can describe this selection pressure. Our method uses a feature vector constructed with three types of parameters (Fig. 1). The three types of parameters can be described as compositional, centrodial and distributional. Compositional parameters (type I) measure the extent to which the proportion of amino acids deviate from the expected. Centroidial parameters (type II) measure the extent to which amino acids tend to be in a particular region of the protein. Distributional parameters (type III) measure the extent to which amino acids cluster along the length of the protein. All parameters are adjusted for the length of the protein. From simulation of parameter distributions, it can be seen that, for some range of protein lengths and frequencies of amino acids, individual parameters have an approximately Gaussian distribution over large data sets (data not shown). The distance between feature vectors is measured as Euclidean distance.
Figure 1

Calculation of the Protein Feature Vectors.

(A) Intermediate calculations necessary to compute three component parameters of the feature vector, parameters are computed as (measure − mean)/sqrt(variance); (B) sample parameters of Types I–III for an amino acid of type X, the full 60-dimentional feature vector for a given protein will include parameters for all amino acid types in that protein (3 parameters for each of 20 amino acid types).

Calculation of the Protein Feature Vectors.

(A) Intermediate calculations necessary to compute three component parameters of the feature vector, parameters are computed as (measure − mean)/sqrt(variance); (B) sample parameters of Types I–III for an amino acid of type X, the full 60-dimentional feature vector for a given protein will include parameters for all amino acid types in that protein (3 parameters for each of 20 amino acid types). To describe any protein using the feature vector method, we must first compute parameters of type I, II and III for each of the twenty amino acids which occur in proteins (Supplement S1). The parameter for each amino acid can be computed based on the information in figure 1A, by subtracting the theoretical mean from the measure and dividing by the square root of the theoretical variance. As an example, the formula for a parameter of type I for glycine (G) is: Here, π is the probability of glycine occurring in a theoretical dataset of proteins under neutral evolution. This probability is set by the user and can vary for a given dataset. For the current analyses, we use by the number of genetic codons for glycine divided by 64. Alternatively, one can set π to the frequency of glycine in the total dataset of proteins. The variable n is the number of glycines in the protein and N is the length of the protein. The following 9 steps detail the specifics of computing the parameters for the feature vector method and of using the feature vectors to categorize and analyze proteins. C++ and Mathematica (Wolfram Research, Urbana-Champaign, IL) code for steps 1–8 is available in the supplementary online materials (Sample Code S1 and S2). Step 9, performing a principle component analysis (PCA), can be done using Matlab software (Mathworks, Natwich, MA) or similar statistical software. Matlab code is provided in the supplementary online materials (Sample Code S3). 1. Count the number of amino acids of each type and note the length of the protein. We denote the length of protein as N and the number of amino acids of each type as nA, nR, nN, nD, nC, nE, nQ, nG, nH, nI, nL, nK, nM, nF, nP, nS, nT, nW, nY and nV. Use these numbers to compute the proportion of amino acids in the protein, the type I measure (fig. 1A): pA, pR, pN, pD, pC, pE, pQ, pG, pH, pI, pL, pK, pM, pF, pP, pS, pT, pW, pY and pV. (Where, A = Alanine, R = Arginine, N = Asparagine, D = Aspartic acid, C = Cysteine, E = Glutamic, Q = Glutamine, G = Glycine, H = Histidine, I = Isoleucine, L = Leucine, K = Lysine, M = Methionine, F = Phenylalanine, P = Proline, S = Serine, T = Threonine, W = Tryptophan, Y = Tyrosine, V = Valine). 2. For each amino acid, find the indices of the positions in the protein sequence which contain that amino acid (we count the first position as one and not zero). 3. Use the indices from step 2 to compute the mean position of each amino acid, the type II measure: mA, mR, mN, mD, mC, mE, mQ, mG, mH, mI, mL, mK, mM, mF, mP, mS, mT, mW, mY and mV. The type II theoretical mean is calculated as the average of all possible positions in a protein of length N. 4. Using the indices from step 2 compute the unbiased variance, type III measure, of the set of indices of each amino acid: vA, vR, vN, vD, vC, vE, vQ, vG, vH, vI, vL, vK, vM, vF, vP, vS, vT, vW, vY and vV. The type III theoretical mean is calculated as half of one less than the square of the length. 5. The type I theoretical mean is the a priori estimate of the probability of the occurrence of each amino acid: πA, πR, πN, πD, πC, πE, πQ, πG, πH, πI, πL, πK, πM, πF, πP, πS, πT, πW, πY and πV, This value was taken to be the number of codons corresponding to a particular amino acid divided by the total number of coding codons, although it could also more appropriately be taken as the rate of occurrence of amino acids in the population from which the sample proteins were selected. Use the type I mean to compute the type I theoretical variance of index positions given the observed number of amino acids. 6. Compute the type II theoretical variance in the mean and the type III theoretical variance of the variance given the observed number of amino acids as shown in figure 1A. 7. Normalize the measures computed in steps 1, 3 & 4 by subtracting the theoretical means, computed in steps 3, 4 & 5, and dividing by the square root of the theoretical variances, computed in steps 5 & 6. 8. Assemble the parameters into a 60 dimensional vector. 9. Using principle components analysis (PCA) or some other means of high dimensional data analysis, search for clusters or patterns in the protein data. This method can be adapted for use in analyzing DNA or RNA sequences. To create a feature vector for a given DNA sequence, one need only create parameters of types I, II and III for each nucleic acid type, and then combine these parameters into a feature vector. The resulting DNA feature vector will be 12-dimensional, with 3 types of parameter for each of 4 types of nucleic acid, compared to the 60-dimensional protein feature vector (Tables S1, S2, and S3 contain feature vectors for our PKC, hemoglobin and myoglobin datasets, respectively). To demonstrate the utility of this method, we applied our feature vector method to an exhaustive set of 128 proteins from the PKC family and a set of 904 hemoglobins and 150 myoglobins. The UniProt KB accession numbers for these proteins are provided in the supplementary online materials (see Tables 1– 4 for accession numbers; see Data Set S1 for additional information).
Table 1

NCBI or SwissProt/UniProt Accession Numbers for Protein Kinase C dataset.

NP_001006133P09215P87253Q5R4K9Q7SY24Q9UVJ5
NP_001008716P09216P90980Q5TZD4Q7SZH7Q9Y792
NP_001012707P09217Q00078Q62074Q7SZH8Q9Y7C1
O01715P10102Q02111Q62101Q7T2C5XM_391874
O17874P10829Q02156Q64617Q86XJ6XP_001066028
O19111P10830Q02956Q69G16Q86ZV2XP_001116804
O42632P13677Q04759Q6AZF7Q873Y9XP_001147999
O61224P16054Q05513Q6BI27Q8IUV5XP_001250401
O61225P17252Q05655Q6C292Q8J213XP_234108
O62567P20444Q15139Q6DCJ8Q8JFZ9XP_421417
O62569P23298Q16974Q6DUV1Q8K1Y2XP_540151
O76850P24583Q16975Q6FJ43Q8K2K8XP_541432
O94806P24723Q19266Q6GNZ7Q8MXB6XP_583587
O96942P28867Q25378Q6P5Z2Q8NE03XP_602125
O96997P34885Q28EN9Q6P748Q90XF2XP_683138
P04409P36582Q2NKI4Q6UB96Q91569XP_849292
P05126P36583Q2U6A7Q6UB97Q91948XP_851386
P05129P41743Q3UEA6Q75BT0Q99014XP_851861
P05130P43057Q498G7Q76G54Q9BZL6
P05696P63318Q4AED5Q7LZQ8Q9GSZ3
P05771P68403Q4AED6Q7LZQ9Q9HF10
P05772P68404Q4R4U2Q7QCP8Q9HGK8
Table 2

SwissProt/UniProt Accession Numbers for Hemoglobin dataset – Part 1.

A1A4Q3O24520P01943P01976P02006P02040P02078P02111P02142P07036P08256
A1A4Q7O24521P01944P01977P02007P02042P02080P02112P02143P07402P08257
A1YZP4O42425P01945P01978P02008P02044P02081P02113P02208P07403P08258
A2TDC2O76242P01946P01979P02009P02046P02082P02114P02209P07404P08259
A2TDC3O76243P01947P01980P02010P02047P02083P02115P02213P07405P08260
A2V8C0O77655P01948P01981P02011P02048P02084P02116P04237P07406P08261
A2V8C1O81941P01949P01982P02012P02049P02085P02117P04238P07407P08422
A2V8C2O88752P01950P01983P02013P02050P02086P02118P04239P07408P08423
A4GTS5O88754P01951P01984P02014P02051P02087P02120P04240P07409P08535
A4GX73O93348P01953P01985P02015P02052P02088P02121P04241P07410P08849
A6YH87O93349P01954P01986P02016P02053P02089P02122P04242P07411P08850
A7UAU9O93351P01955P01987P02017P02054P02090P02123P04244P07412P08851
A9JSP7O96457P01956P01988P02018P02055P02091P02124P04245P07413P08852
A9XDF6P01923P01957P01989P02019P02057P02092P02125P04246P07414P08853
B0BL34P01924P01958P01990P02020P02058P02093P02126P04252P07415P09105
B0BL35P01926P01959P01991P02021P02059P02094P02127P04346P07417P09420
B1H216P01928P01960P01992P02022P02060P02095P02128P04442P07419P09421
B1Q450P01929P01961P01993P02024P02061P02097P02129P04443P07421P09422
B2BP38P01930P01962P01994P02025P02062P02099P02130P04444P07425P09423
B2ZUE0P01932P01963P01995P02026P02064P02100P02131P06148P07428P09839
O02004P01933P01964P01996P02028P02065P02101P02132P06467P07429P09840
O02480P01934P01965P01997P02029P02066P02102P02133P06635P07430P09904
O04985P01935P01966P01998P02030P02067P02103P02134P06636P07431P09905
O04986P01936P01967P01999P02031P02070P02104P02135P06637P07432P09906
O09232P01937P01968P02000P02032P02072P02105P02136P06638P07433P09907
O12985P01938P01969P02001P02033P02073P02106P02137P06639P07803P09908
O13077P01939P01971P02002P02035P02074P02107P02138P06642P08054P09909
O13078P01940P01972P02003P02036P02075P02108P02139P06643P08223P0C0U6
O13163P01941P01973P02004P02038P02076P02109P02140P07034P08224P0C0U7
O13164P01942P01975P02005P02039P02077P02110P02141P07035P08225P0C0U8
Table 3

SwissProt/UniProt Accession Numbers for Hemoglobin dataset – Part 2.

P0C237P14524P20018P41327P67822P68194P83132Q27940Q6AW44Q8AXX7Q9TSN9
P0C238P14525P20019P41328P67823P68222P83133Q28220Q6AW45Q8AYL9Q9TVA3
P0C239P14526P20243P41329P67824P68223P83134Q28221Q6B0K9Q8AYM0Q9U6L2
P0C240P14527P20244P41330P68011P68224P83135Q28338Q6BBJ0Q8AYM1Q9U6L6
P10057P15162P20245P41331P68012P68225P83270Q28356Q6BBJ2Q8BPF4Q9XSE9
P10058P15163P20246P41332P68013P68226P83271Q28496Q6BBK1Q8BYM1Q9XSK1
P10059P15164P20247P45718P68014P68227P83272Q28507Q6H1U7Q8GV40Q9XSN2
P10060P15165P20854P45719P68015P68228P83273Q28775Q6IBF6Q8GV41Q9XSN3
P10061P15166P20855P45720P68016P68229P83478Q28779Q6LDH0Q8GV42Q9XTL1
P10062P15448P21197P45721P68017P68230P83479Q28931Q6LDH1Q8HY34Q9XTM2
P10777P15449P21198P45722P68018P68231P83611Q28932Q6QDC2Q8QG65Q9Y0D5
P10778P15469P21199P51438P68019P68232P83612Q29415Q6R7N2Q90485Q9Y0E6
P10780P16309P21200P51440P68020P68234P83613Q2KPA3Q6S9E2Q90486Q9YGW1
P10781P16417P21201P51441P68021P68235P83614Q2KPA4Q6T497Q90487Q9YGW2
P10782P16418P21379P51442P68022P68236P83623Q2MHE0Q6WN20Q90ZM4
P10783P17689P21380P51443P68023P68237P83624Q2PAD4Q6WN21Q91473
P10784P18435P21667P51465P68024P68238P83625Q3C1F3Q6WN22Q941P9
P10785P18436P21668P55267P68025P68239P84203Q3C1F4Q6WN25Q941Q1
P10786P18707P21766P56250P68026P68240P84204Q3MQ26Q6WN26Q946U7
P10883P18969P21767P56251P68027P68256P84205Q3S3T0Q6WN27Q947C5
P10885P18970P21768P56285P68028P68257P84206Q3TUN7Q6WN28Q94FG6
P10892P18971P21871P56691P68029P68258P84216Q3U0A6Q6WN29Q94FT7
P10893P18972P22740P56692P68030P68871P84217Q3Y9L5Q6Y239Q94FT8
P11025P18973P22741P60523P68031P68872P84479Q3Y9L6Q6Y257Q95190
P11251P18974P22742P60524P68044P68873P84604Q42665Q760P9Q95238
P11342P18975P22743P60525P68045P68944P84609Q42831Q760Q0Q95NK8
P11517P18976P23016P60526P68046P68945P84610Q43306Q760Q1Q966U3
P11748P18977P23017P60529P68047P69891P84611Q45V69Q760Q2Q96FH6
P11749P18978P23018P60530P68048P69892P84652Q45V70Q78PA4Q98905
P11750P18981P23019P61772P68049P69905P84653Q45XH3Q7JFN6Q98TS0
P11751P18982P23020P61773P68050P69906P84790Q45XH5Q7JFR7Q9AWA9
P11752P18983P23600P61774P68051P69907P84791Q45XH6Q7LZB9Q9CWS5
P11753P18984P23601P61775P68052P80043P84792Q45XH7Q7LZC1Q9CY06
P11754P18985P23602P61920P68053P80044P85081Q45XH8Q7LZC2Q9CY10
P11755P18986P23740P61921P68054P80216P85082Q45XI4Q7LZC3Q9CZK5
P11756P18987P23741P61947P68055P80270Q03902Q45XI5Q7LZL6Q9D0B2
P11757P18988P24291P61948P68056P80271Q0PB48Q45XI6Q7LZM6Q9DF25
P11758P18989P24292P62363P68057P80726Q0PG38Q45XI7Q7LZM7Q9FUD6
P11896P18990P24589P62387P68058P80727Q0WSU5Q45XI8Q7LZM8Q9FVL0
P13273P18993P24659P62741P68059P80945Q0ZA50Q45XI9Q7M2Y4Q9FY42
P13274P18994P24660P62742P68060P80946Q10732Q45XJ0Q7M2Y5Q9GJS7
P13557P18995P26915P63105P68061P81023Q10733Q4F6Z2Q7M3B6Q9GLX4
P13558P18996P26916P63106P68062P81024Q17153Q4VIX3Q7M3B8Q9I9I3
P14259P19002P28780P63107P68063P81042Q17155Q53I65Q7M3C2Q9M3U9
P14260P19014P28781P63108P68064P81043Q17156Q549D9Q7M413Q9M593
P14261P19015P29623P63109P68065P82111Q17157Q549G1Q7M418Q9M630
P14387P19016P29624P63110P68068P82112Q17286Q58L97Q7M419Q9PRL9
P14388P19645P29625P63111P68069P82113Q1AGS4Q5BLF6Q7M421Q9PVM1
P14389P19646P29626P63112P68070P82315Q1AGS5Q5GLZ6Q7M422Q9PVM2
P14390P19759P29628P67815P68071P82316Q1AGS6Q5GLZ7Q7T1B0Q9PVM3
P14391P19760P30892P67816P68077P82345Q1AGS7Q5I122Q7Y079Q9PVM4
P14392P19789P30893P67817P68078P82990Q1AGS8Q5KSB7Q7Y1Y1Q9PVU6
P14520P19831P33499P67818P68079P83114Q1AGS9Q5MD69Q7ZT21Q9TS34
P14521P19832P41260P67819P68087P83123Q1W6G9Q5RM02Q803Z5Q9TS35
P14522P19885P41261P67820P68168P83124Q26505Q5XLE5Q862A7Q9TSN7
P14523P19886P41262P67821P68169P83131Q27126Q67XG0Q86G74Q9TSN8
Table 4

SwissProt/UniProt Accession Numbers for Myoglobin dataset.

O77003P02159P02174P02190P02205P14393P51535P68086Q01966Q701N9Q9DEP1
P02144P02160P02177P02191P02206P14396P51537P68189Q03459Q76G09Q9DGI8
P02145P02161P02178P02192P02210P14397P56208P68190Q0KIY0Q7LZM1Q9DGI9
P02147P02163P02179P02193P02211P14398P62734P68276Q0KIY1Q7LZM2Q9DGJ0
P02148P02164P02180P02194P02214P14399P62735P68277Q0KIY2Q7LZM3Q9DGJ1
P02150P02165P02181P02196P02215P15160P63113P68278Q0KIY3Q7LZM4Q9DGJ2
P02151P02166P02182P02197P04247P17724P63114P68279Q0KIY5Q7LZM5Q9QZ76
P02152P02167P02183P02199P04248P20856P68080P80721Q0KIY7Q7M416
P02153P02168P02184P02200P04249P29287P68081P80722Q0KIY9Q7M424
P02154P02169P02185P02201P04250P30562P68082P83682Q2MJN4Q7T044
P02155P02170P02186P02202P09965P31331P68083P84997Q6I7B0Q9DEN8
P02156P02171P02187P02203P0C227P32428P68084P85077Q6PL31Q9DEN9
P02157P02173P02189P02204P11343P49672P68085P87497Q6VN46Q9DEP0

Visualization of the Feature Vectors

The feature vector is high dimensional. PCA is a popular method of reducing the number of variables in a vector. This method provides a linear transformation of the variables of the feature vector into a new set of uncorrelated variables. In so doing, it captures the dominant variations in the data set. The first two principal components contain more information than any other pair of linearly constructed variables and thus are used in our analysis [6]. This allows us to easily visualize the key elements of the data without loss of much information. An alternative method for visualizing the patterns emerging from the feature vector method of analysis is to create a dendrogram using agglomerative hierarchical clustering. We use agglomerative hierarchical clustering with complete linkage [7] to provide a more detailed view of the data, and of the relationships between groups within the data. In order to demonstrate the utility and flexibility of the feature vector method, we provide one dendrogram as part of the PKC analysis.

Identification of Protein Sub-Types

Kinases are proteins which modify other proteins by phosphorylation, the covalent addition of phosphate groups [8]. The PKC family is a large multigene family of serine/threonine kinases [8]. Six main groups of PKCs can be identified by domain architecture: conventional, novel, atypical, PKCμ-like, fungal PKC1, and PKC-related kinases [9]. The first three of these groups can be further categorized into subtypes [9]. In general, the PKC domain architecture (fig. 2A) consists of a regulatory region and a catalytic domain [10]. The regulatory region contains several functional domains of varying types [10]. True PKCs are classified as conventional, novel or atypical based on the functional domains present in the regulatory regions of the PKCs (fig. 2A). Briefly, conventional PKCs contain subtypes α, βI, βII and γ; novel PKCs contain subtypes θ, ε, δ and η; and atypical PKCs contain λ\ι and ζ [9]. The catalytic domain is more conserved and more commonly used for differentiating between families of protein kinases [11]. However, it is also useful in characterizing the PKCμ-like kinases, which contain markedly different catalytic regions from the rest of the protein kinase C members [12].
Figure 2

Analysis of the Protein Kinase C (PKC) family using the feature vector method.

(A) Structural architecture types of PKCs used in analysis, showing regulatory domains – C1, C2, PB1, HR1 and PH – and catalytic domains. cPKC – conventional PKCs, aPKC – atypical PKCs, nPKC – novel PKCs, PKC1 – fungal PKC1s, PRK – PKC-related kinases, PKCμ – PKCμ-like PKCs; (B) agglomerative hierarchical clustering dendrogram of PKC feature vectors for structural architecture types; (C) principle component analysis (PCA) of PKC feature vectors, coded by known architectural type of proteins – red dots: cPKCs; blue dots: aPKCs; green dots: nPKCs; black dots: PKC1s; orange dots: PKCμ; yellow dots: PRKs; dashed lines indicate dividing surfaces for identifying major clusters in the data set.

Analysis of the Protein Kinase C (PKC) family using the feature vector method.

(A) Structural architecture types of PKCs used in analysis, showing regulatory domains – C1, C2, PB1, HR1 and PH – and catalytic domains. cPKC – conventional PKCs, aPKC – atypical PKCs, nPKC – novel PKCs, PKC1 – fungal PKC1s, PRK – PKC-related kinases, PKCμ – PKCμ-like PKCs; (B) agglomerative hierarchical clustering dendrogram of PKC feature vectors for structural architecture types; (C) principle component analysis (PCA) of PKC feature vectors, coded by known architectural type of proteins – red dots: cPKCs; blue dots: aPKCs; green dots: nPKCs; black dots: PKC1s; orange dots: PKCμ; yellow dots: PRKs; dashed lines indicate dividing surfaces for identifying major clusters in the data set. The dendrogram created by agglomerative hierarchical clustering of the PKC feature vectors (fig. 2B) successfully recreates the phylogenetic relationships between PKC architectural types and highlights the degree of difference between PKC-related kinases, PKC1s and other PKCs. By running a PCA on the feature vector values for each PKC protein, we were able to quickly visualize the six architecture types of PKCs in our dataset. Figure 2C shows the PCA output for PKCs. A dashed line marking a clear dividing surface is added to this figure to demonstrate divisions in the data that warrant further analysis. Fungal PKC1s are clearly separated from other PKCs and can be identified as an important, distinct grouping (fig. 2C). In addition, PKC-related kinases and true PKCs are located in distinct clusters (fig. 2C). Finally, conventional PKCs and novel PKCs are resolved into distinct clusters (fig. 2C). A similar identification of hemoglobin structure was also possible (fig. 3). Hemoglobins are large proteins which function in oxygen transport [13]. Each hemoglobin molecule contains 2 alpha chains and 2 beta chains, subunits which are identifiable by structural characteristics [13]. Alpha hemoglobins lack a specific alpha-helix, the D helix, that is present in beta hemoglobins [14]. Many organisms have several distinct hemoglobins, an adult form and embryonic or fetal forms created by combining different alpha and beta hemoglobin units [13]. There are more types of embryonic and fetal alpha hemoglobins than beta hemoglobins, and thus alpha hemoglobins are presumed evolutionarily older [13]. In protein databases, alpha and beta chain hemoglobins belonging to a given species are typically recorded in separate entries and so are separate in our data set also.
Figure 3

Analysis of hemoglobin proteins using the feature vector method.

PCA results of hemoglobin feature vectors, coded by known protein type – red dots: beta-chain hemoglobins; blue dots: alpha-chain hemoglobins, including fetal alpha-type proteins; green dots: leghaemoglobins; grey dots: other hemoglobins.

Analysis of hemoglobin proteins using the feature vector method.

PCA results of hemoglobin feature vectors, coded by known protein type – red dots: beta-chain hemoglobins; blue dots: alpha-chain hemoglobins, including fetal alpha-type proteins; green dots: leghaemoglobins; grey dots: other hemoglobins. Using the feature vector method, the PCA identifies the difference between alpha chain hemoglobins, beta chain hemoglobins and leghaemoglobins, a type of monomeric hemoglobin chain found in plants [15], as the main features of the protein set (fig. 3). A range of other types of hemoglobins occur in the dataset in small numbers, but are not well resolved into distinct clusters due to their rarity in the data (see below). These proteins include the non-symbiotic plant hemoglobins, also called truncated or 2-on-2 hemoglobins [16]; the lamprey/hagfish hemoglobins [17]; bacterial hemoglobins; and erythrocruorins, which are large extracellular hemoglobins found in annelid worms and arthropods [18]. The feature vector method is able to store large amounts of information about the proteins in the dataset. When the 150 myoglobins are added to the hemoglobin dataset, the feature vector is able to distinguish these two types of proteins (fig. 4A), while retaining information about structural relationships within the hemoglobin family (fig. 4B). Myoglobins are single chain hemoproteins and share a common ancestor with hemoglobins, more than 500 million years ago [19]. Structurally, myoglobins are similar to leghaemoglobins, but functionally these proteins are quite different, with leghaemoglobins having significantly higher oxygen affinity and a broad range of functions within plant nodules [15], [19], [20]. The feature vector method, combined with PCA for visualization, clearly separates the majority of myoglobins from hemoglobins (fig. 4A), while preserving the ability to identify structural relationships between alpha and beta hemoglobins and leghaemoglobins (fig. 4B).
Figure 4

Identification of protein type as myoglobin and hemoglobin by feature vector analysis.

(A) PCA results of hemoglobin and myoglobin feature vectors, coded by known protein family – red dots: all hemoglobins; black triangles: myoglobins; (B) PCA results of hemoglobin and myoglobin feature vectors, coded by known protein type with hemoglobins identified by subtype – red dots: beta-chain hemoglobins; blue dots: alpha-chain hemoglobins, including fetal alpha-type proteins; green dots: leghaemoglobins; grey dots: other hemoglobins; black triangles: myoglobins.

Identification of protein type as myoglobin and hemoglobin by feature vector analysis.

(A) PCA results of hemoglobin and myoglobin feature vectors, coded by known protein family – red dots: all hemoglobins; black triangles: myoglobins; (B) PCA results of hemoglobin and myoglobin feature vectors, coded by known protein type with hemoglobins identified by subtype – red dots: beta-chain hemoglobins; blue dots: alpha-chain hemoglobins, including fetal alpha-type proteins; green dots: leghaemoglobins; grey dots: other hemoglobins; black triangles: myoglobins. When analyzing a large and varied protein dataset, some protein types may occur infrequently. These rare protein types are more difficult to cluster using PCA due to the limited amount of information available to the algorithm. As a result, these rare proteins cluster near the center of the PCA output, creating ‘noise’ in the analysis (fig. 3). By limiting the hemoglobin dataset to only adult alpha and beta chain mammalian hemoglobins and mammalian myoglobins, the most frequent protein types present in the dataset, the ability of the feature vector method to create clear separation between different groups of proteins is readily apparent (fig. 5). As in previous figures, dashed lines indicating decision surfaces are used to highlight clusters warranting further analysis.
Figure 5

Use of the feature vector method allows unequivocal protein identification when data is limited to large, well-defined protein types.

PCA results of adult alpha and adult beta mammalian hemoglobins and mammalian myoglobin feature vectors, coded by known protein type to demonstrate the ability of the feature vector to produce perfect separation of types – dashed lines indicate dividing surfaces for identifying clusters in the data; red dots: beta-chain mammalian hemoglobins; blue dots: alpha-chain mammalian hemoglobins, excluding fetal proteins; black triangles: mammalian myoglobins.

Use of the feature vector method allows unequivocal protein identification when data is limited to large, well-defined protein types.

PCA results of adult alpha and adult beta mammalian hemoglobins and mammalian myoglobin feature vectors, coded by known protein type to demonstrate the ability of the feature vector to produce perfect separation of types – dashed lines indicate dividing surfaces for identifying clusters in the data; red dots: beta-chain mammalian hemoglobins; blue dots: alpha-chain mammalian hemoglobins, excluding fetal proteins; black triangles: mammalian myoglobins.

Discussion

The feature vector method described here is intended to measure the distance between protein sequences in a way that makes numerical comparisons easy and allows identification of similarity within large numbers of proteins that are not too distantly related. Using PKCs and hemoproteins as examples, we demonstrated the effectiveness of this method. When groups are completely distinct, perfect separation can be achieved; where there are gradual changes in the sequences of proteins, the feature vector performs well in conjunction with principal components analysis. Importantly, this method does not attempt to characterize differences as functional or non-functional, nor does it seek to identify key single point mutations. Rather, the goal is to provide a rapid understanding of the patterns of relatedness in large datasets of protein sequences. Although protein kinase C was one of the first protein kinases discovered [21], categorizing members of this family is particularly challenging [22], [23]. Our method successfully reproduces the traditional classification of PKCs and clusters family members on the basis of these classifications [24]. Previous work analyzing relationships among multiple PKCs or among the larger kinase superfamily has been limited by the maximum dataset size [23], [24], [25], in a way that our method is not. The statistical feature vector method is particularly useful as a simple way of identifying subgroups in non-mammalian PKCs, an area where little is known. In the future, more detailed visualization techniques may suggest new relationships which could be explored experimentally. Mammalian hemoglobins are also well understood, in terms of classification [16]. However, research is increasingly identifying hemoglobin-like proteins outside of mammals, including bacterial hemoglobins and non-symbiotic hemoglobins in plants [16]. As the number of hemoglobins identified in these organisms increases, the feature vector method will provide a simple tool for identifying structural groupings within these proteins. The feature vector method provides one of the most definitive ways of classifying various types of proteins. This method provides an advantage over other classification programs in ease of use and, unlike other methods, the feature vector is not constrained to a single protein family or superfamily [23]. We have shown the usefulness of this method in PKCs and hemoproteins, and we anticipate that it will perform equally well when applied to other protein families providing a simple, rapid tool for sorting through the increasingly large datasets of proteins now available to researchers. In the future, the utility of this method can be increased by applying new, and more specific, visualization tools to the analysis of the feature vector output, such as K-means, agglomerative hierarchical clustering, artificial neural networks and self-organizing maps. For a given data set, the patterns of variation in sequences can be learned by neural networks, or other methods, to provide a more accurate classification or clustering than can be achieved with less flexible methods like principal components analysis.

Methods

Datasets

We used three online protein sequence databases to create our protein datasets: Uniprot KB, UniprotKB/Swissprot, and NCBI Entrez-Protein. UniprotKB (www.uniprot.org) is an online repository of protein sequences; UniprotKB/Swissprot (http://ca.expasy.org/sprot/) builds upon this repository through annotation of protein sequences. Information available in UniprotKB/Swissprot includes citations for related publications, species name, protein family, domain structure and detail on protein variants and structure. NCBI Entrez-Protein (http://www.ncbi.nlm.nih.gov/protein/) is an online protein sequence database curated by the National Center for Biotechnology Information (NCBI). The protein kinase C dataset of 127 protein sequences was downloaded from the NCBI Entrez-Protein and UniProtKB/SwissProt databases. The hemoglobin and myoglobin datasets, of 904 and 150 protein sequences respectively, were downloaded from the UniProtKB database. In order to ensure that sequences were not fragments or labeled incorrectly by protein family, sequences were analyzed using the SMART domain recognition software on the UniProtKB website. In addition, for all sequences the family classification was confirmed and the subfamily classification was assigned based on peer-reviewed journal articles which were obtained through the SwissProt database reference listings and based on notations on the UniProtKB entries where detailed information from articles was not available. Sample Parameter Calculations. This file works through the calculations of the parameters for Alanine in a short, hypothetical protein, and demonstrates the construction of the feature vector for this protein. (0.05 MB PDF) Click here for additional data file. Feature vectors computation in C++. Computation of the feature vectors for a protein data set in C++. (0.01 MB TXT) Click here for additional data file. Feature vector computation in Mathematica. Computation of the feature vector for a single protein in Mathematica. (0.00 MB TXT) Click here for additional data file. PCA code for Matlab. Principle component analysis code for Matlab. (0.00 MB TXT) Click here for additional data file. Protein Kinase C Feature Vectors. This file contains the set of all feature vectors for the PKC proteins in our dataset. (0.16 MB XLS) Click here for additional data file. Hemoglobin Feature Vectors. This file provides the set of all feature vectors for the Hemoglobins in our dataset. (1.03 MB XLS) Click here for additional data file. Myoglobin Feature Vectors. This file provides the set of all feature vectors for the Myoglobins in our dataset. (0.17 MB XLS) Click here for additional data file. Protein datasets. Accession numbers and taxonomic information for Protein Kinase C (PKC), Hemoglobin and Myoglobin dataset. Each protein dataset is provided as a separate worksheet. (0.16 MB XLS) Click here for additional data file.
  22 in total

Review 1.  Plants, humans and hemoglobins.

Authors:  Suman Kundu; James T Trent; Mark S Hargrove
Journal:  Trends Plant Sci       Date:  2003-08       Impact factor: 18.313

Review 2.  The protein kinase family: conserved features and deduced phylogeny of the catalytic domains.

Authors:  S K Hanks; A M Quinn; T Hunter
Journal:  Science       Date:  1988-07-01       Impact factor: 47.728

3.  Multiple, distinct forms of bovine and human protein kinase C suggest diversity in cellular signaling pathways.

Authors:  L Coussens; P J Parker; L Rhee; T L Yang-Feng; E Chen; M D Waterfield; U Francke; A Ullrich
Journal:  Science       Date:  1986-08-22       Impact factor: 47.728

4.  Lamprey hemoglobin. Structural basis of the bohr effect.

Authors:  Y Qiu; D H Maillett; J Knapp; J S Olson; A F Riggs
Journal:  J Biol Chem       Date:  2000-05-05       Impact factor: 5.157

Review 5.  Myoglobin: an essential hemoprotein in striated muscle.

Authors:  George A Ordway; Daniel J Garry
Journal:  J Exp Biol       Date:  2004-09       Impact factor: 3.312

6.  Complete amino acid sequences of the major early embryonic alpha-like globins of the chicken.

Authors:  B S Chapman; A J Tobin; L E Hood
Journal:  J Biol Chem       Date:  1980-10-10       Impact factor: 5.157

Review 7.  Protein kinase D: a family affair.

Authors:  An Rykx; Line De Kimpe; Svetlana Mikhalap; Tibor Vantus; Thomas Seufferlein; Jackie R Vandenheede; Johan Van Lint
Journal:  FEBS Lett       Date:  2003-07-03       Impact factor: 4.124

Review 8.  Protein kinase C isozymes as potential targets for anticancer therapy.

Authors:  Johann Hofmann
Journal:  Curr Cancer Drug Targets       Date:  2004-03       Impact factor: 3.428

9.  The relationship of protein conservation and sequence length.

Authors:  David J Lipman; Alexander Souvorov; Eugene V Koonin; Anna R Panchenko; Tatiana A Tatusova
Journal:  BMC Evol Biol       Date:  2002-11-01       Impact factor: 3.260

10.  Kinomer v. 1.0: a database of systematically classified eukaryotic protein kinases.

Authors:  David M A Martin; Diego Miranda-Saavedra; Geoffrey J Barton
Journal:  Nucleic Acids Res       Date:  2008-10-30       Impact factor: 16.971

View more
  5 in total

1.  A novel method of characterizing genetic sequences: genome space with biological distance and applications.

Authors:  Mo Deng; Chenglong Yu; Qian Liang; Rong L He; Stephen S-T Yau
Journal:  PLoS One       Date:  2011-03-02       Impact factor: 3.240

2.  Computational Prediction and Analysis of Associations between Small Molecules and Binding-Associated S-Nitrosylation Sites.

Authors:  Guohua Huang; Jincheng Li; Chenglin Zhao
Journal:  Molecules       Date:  2018-04-19       Impact factor: 4.411

3.  Identification of Enzymes-specific Protein Domain Based on DDE, and Convolutional Neural Network.

Authors:  Rahu Sikander; Yuping Wang; Ali Ghulam; Xianjuan Wu
Journal:  Front Genet       Date:  2021-11-30       Impact factor: 4.599

4.  DFA7, a new method to distinguish between intron-containing and intronless genes.

Authors:  Chenglong Yu; Mo Deng; Lu Zheng; Rong Lucy He; Jie Yang; Stephen S-T Yau
Journal:  PLoS One       Date:  2014-07-18       Impact factor: 3.240

5.  Broad cross-reactive IgG responses elicited by adjuvanted vaccination with recombinant influenza hemagglutinin (rHA) in ferrets and mice.

Authors:  Jiong Wang; Shannon P Hilchey; Marta DeDiego; Sheldon Perry; Ollivier Hyrien; Aitor Nogales; Jessica Garigen; Fatima Amanat; Nelson Huertas; Florian Krammer; Luis Martinez-Sobrido; David J Topham; John J Treanor; Mark Y Sangster; Martin S Zand
Journal:  PLoS One       Date:  2018-04-11       Impact factor: 3.240

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.