Literature DB >> 20221427

A rapid method for characterization of protein relatedness using feature vectors.

Kareem Carr¹, Eleanor Murray, Ebenezer Armah, Rong L He, Stephen S-T Yau.

Abstract

We propose a feature vector approach to characterize the variation in large data sets of biological sequences. Each candidate sequence produces a single feature vector constructed with the number and location of amino acids or nucleic acids in the sequence. The feature vector characterizes the distance between the actual sequence and a model of a theoretical sequence based on the binomial and uniform distributions. This method is distinctive in that it does not rely on sequence alignment for determining protein relatedness, allowing the user to visualize the relationships within a set of proteins without making a priori assumptions about those proteins. We apply our method to two large families of proteins: protein kinase C, and globins, including hemoglobins and myoglobins. We interpret the high-dimensional feature vectors using principal components analysis and agglomerative hierarchical clustering. We find that the feature vector retains much of the information about the original sequence. By using principal component analysis to extract information from collections of feature vectors, we are able to quickly identify the nature of variation in a collection of proteins. Where collections are phylogenetically or functionally related, this is easily detected. Hierarchical agglomerative clustering provides a means of constructing cladograms from the feature vector output.

Entities: CellLine Chemical Gene Species

Mesh：

Substances：

Year: 2010 PMID： 20221427 PMCID： PMC2832692 DOI： 10.1371/journal.pone.0009550

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Recent advances in biotechnology have allowed sequencing of millions of proteins from a wide spectrum of organisms and this information is rapidly becoming accessible to any researcher with an internet connection. For instance, the UniProtKB database currently contains over 7 million protein sequences and is updated every three weeks [1]. Current protein alignment methods are often slow and require assumptions about relatedness and evolutionary mechanisms [2]. In order to make use of the vast amount of protein data available, a method for quickly delineating large numbers of proteins into related types is necessary. As a solution, we propose a method for quantifying the frequency and position of amino acids within a protein, and demonstrate the ease, rapidity and usefulness of this technique for uncovering phylogenetic and functional relationships within protein families, using protein kinase C (PKC), hemoglobin and myoglobin as examples. One of us (S.Y.) previously designed three types of parameters for use in clustering amino acid sequences [3]. Here, we reconceptualize these parameters, making them more statistical in character and more applicable to measuring protein similarity. The new parameters measure the degree to which the distribution of amino acids in a particular protein deviates from a theoretical protein containing an equal number of residues but undergoing neutral evolution. This measure allows us not only to characterize the extent to which the distribution of amino acids in a protein deviates from the expected but also the distance between proteins. Our theoretical model of the distribution of amino acids in a protein assumes that for any given amino acid, the locations of residues of that amino acid are distributed uniformly and the number of residues is distributed binomially. In contrast to previous methods of constructing distance measures to determine protein relatedness [4], our method does not require performing multiple sequence alignment. Instead, this method creates measures of protein relatedness based on the distribution of amino acids in the proteins. Differences in these measures are then used to determine the difference between two proteins. In doing so, this method requires no assumptions about the way in which certain amino acids may be inserted or deleted. This allows us to look at protein difference in an abstract way, without making assumptions about the mechanisms by which these differences may arise.

Results

Implementation of the Feature Vectors

The selection pressure to which proteins are subjected affects both the number of residues in the protein [5] and the identity and location of these residues. By using a feature vector, which is an ordered list of numbers that characterize the distribution of amino acids in a protein, we can describe this selection pressure. Our method uses a feature vector constructed with three types of parameters (Fig. 1). The three types of parameters can be described as compositional, centrodial and distributional. Compositional parameters (type I) measure the extent to which the proportion of amino acids deviate from the expected. Centroidial parameters (type II) measure the extent to which amino acids tend to be in a particular region of the protein. Distributional parameters (type III) measure the extent to which amino acids cluster along the length of the protein. All parameters are adjusted for the length of the protein. From simulation of parameter distributions, it can be seen that, for some range of protein lengths and frequencies of amino acids, individual parameters have an approximately Gaussian distribution over large data sets (data not shown). The distance between feature vectors is measured as Euclidean distance.

Figure 1

Calculation of the Protein Feature Vectors.

Calculation of the Protein Feature Vectors.

(A) Intermediate calculations necessary to compute three component parameters of the feature vector, parameters are computed as (measure − mean)/sqrt(variance); (B) sample parameters of Types I–III for an amino acid of type X, the full 60-dimentional feature vector for a given protein will include parameters for all amino acid types in that protein (3 parameters for each of 20 amino acid types). To describe any protein using the feature vector method, we must first compute parameters of type I, II and III for each of the twenty amino acids which occur in proteins (Supplement S1). The parameter for each amino acid can be computed based on the information in figure 1A, by subtracting the theoretical mean from the measure and dividing by the square root of the theoretical variance. As an example, the formula for a parameter of type I for glycine (G) is: Here, π is the probability of glycine occurring in a theoretical dataset of proteins under neutral evolution. This probability is set by the user and can vary for a given dataset. For the current analyses, we use by the number of genetic codons for glycine divided by 64. Alternatively, one can set π to the frequency of glycine in the total dataset of proteins. The variable n is the number of glycines in the protein and N is the length of the protein. The following 9 steps detail the specifics of computing the parameters for the feature vector method and of using the feature vectors to categorize and analyze proteins. C++ and Mathematica (Wolfram Research, Urbana-Champaign, IL) code for steps 1–8 is available in the supplementary online materials (Sample Code S1 and S2). Step 9, performing a principle component analysis (PCA), can be done using Matlab software (Mathworks, Natwich, MA) or similar statistical software. Matlab code is provided in the supplementary online materials (Sample Code S3). 1. Count the number of amino acids of each type and note the length of the protein. We denote the length of protein as N and the number of amino acids of each type as nA, nR, nN, nD, nC, nE, nQ, nG, nH, nI, nL, nK, nM, nF, nP, nS, nT, nW, nY and nV. Use these numbers to compute the proportion of amino acids in the protein, the type I measure (fig. 1A): pA, pR, pN, pD, pC, pE, pQ, pG, pH, pI, pL, pK, pM, pF, pP, pS, pT, pW, pY and pV. (Where, A = Alanine, R = Arginine, N = Asparagine, D = Aspartic acid, C = Cysteine, E = Glutamic, Q = Glutamine, G = Glycine, H = Histidine, I = Isoleucine, L = Leucine, K = Lysine, M = Methionine, F = Phenylalanine, P = Proline, S = Serine, T = Threonine, W = Tryptophan, Y = Tyrosine, V = Valine). 2. For each amino acid, find the indices of the positions in the protein sequence which contain that amino acid (we count the first position as one and not zero). 3. Use the indices from step 2 to compute the mean position of each amino acid, the type II measure: mA, mR, mN, mD, mC, mE, mQ, mG, mH, mI, mL, mK, mM, mF, mP, mS, mT, mW, mY and mV. The type II theoretical mean is calculated as the average of all possible positions in a protein of length N. 4. Using the indices from step 2 compute the unbiased variance, type III measure, of the set of indices of each amino acid: vA, vR, vN, vD, vC, vE, vQ, vG, vH, vI, vL, vK, vM, vF, vP, vS, vT, vW, vY and vV. The type III theoretical mean is calculated as half of one less than the square of the length. 5. The type I theoretical mean is the a priori estimate of the probability of the occurrence of each amino acid: πA, πR, πN, πD, πC, πE, πQ, πG, πH, πI, πL, πK, πM, πF, πP, πS, πT, πW, πY and πV, This value was taken to be the number of codons corresponding to a particular amino acid divided by the total number of coding codons, although it could also more appropriately be taken as the rate of occurrence of amino acids in the population from which the sample proteins were selected. Use the type I mean to compute the type I theoretical variance of index positions given the observed number of amino acids. 6. Compute the type II theoretical variance in the mean and the type III theoretical variance of the variance given the observed number of amino acids as shown in figure 1A. 7. Normalize the measures computed in steps 1, 3 & 4 by subtracting the theoretical means, computed in steps 3, 4 & 5, and dividing by the square root of the theoretical variances, computed in steps 5 & 6. 8. Assemble the parameters into a 60 dimensional vector. 9. Using principle components analysis (PCA) or some other means of high dimensional data analysis, search for clusters or patterns in the protein data. This method can be adapted for use in analyzing DNA or RNA sequences. To create a feature vector for a given DNA sequence, one need only create parameters of types I, II and III for each nucleic acid type, and then combine these parameters into a feature vector. The resulting DNA feature vector will be 12-dimensional, with 3 types of parameter for each of 4 types of nucleic acid, compared to the 60-dimensional protein feature vector (Tables S1, S2, and S3 contain feature vectors for our PKC, hemoglobin and myoglobin datasets, respectively). To demonstrate the utility of this method, we applied our feature vector method to an exhaustive set of 128 proteins from the PKC family and a set of 904 hemoglobins and 150 myoglobins. The UniProt KB accession numbers for these proteins are provided in the supplementary online materials (see Tables 1– 4 for accession numbers; see Data Set S1 for additional information).

Table 1

NCBI or SwissProt/UniProt Accession Numbers for Protein Kinase C dataset.

NP_001006133	P09215	P87253	Q5R4K9	Q7SY24	Q9UVJ5
NP_001008716	P09216	P90980	Q5TZD4	Q7SZH7	Q9Y792
NP_001012707	P09217	Q00078	Q62074	Q7SZH8	Q9Y7C1
O01715	P10102	Q02111	Q62101	Q7T2C5	XM_391874
O17874	P10829	Q02156	Q64617	Q86XJ6	XP_001066028
O19111	P10830	Q02956	Q69G16	Q86ZV2	XP_001116804
O42632	P13677	Q04759	Q6AZF7	Q873Y9	XP_001147999
O61224	P16054	Q05513	Q6BI27	Q8IUV5	XP_001250401
O61225	P17252	Q05655	Q6C292	Q8J213	XP_234108
O62567	P20444	Q15139	Q6DCJ8	Q8JFZ9	XP_421417
O62569	P23298	Q16974	Q6DUV1	Q8K1Y2	XP_540151
O76850	P24583	Q16975	Q6FJ43	Q8K2K8	XP_541432
O94806	P24723	Q19266	Q6GNZ7	Q8MXB6	XP_583587
O96942	P28867	Q25378	Q6P5Z2	Q8NE03	XP_602125
O96997	P34885	Q28EN9	Q6P748	Q90XF2	XP_683138
P04409	P36582	Q2NKI4	Q6UB96	Q91569	XP_849292
P05126	P36583	Q2U6A7	Q6UB97	Q91948	XP_851386
P05129	P41743	Q3UEA6	Q75BT0	Q99014	XP_851861
P05130	P43057	Q498G7	Q76G54	Q9BZL6
P05696	P63318	Q4AED5	Q7LZQ8	Q9GSZ3
P05771	P68403	Q4AED6	Q7LZQ9	Q9HF10
P05772	P68404	Q4R4U2	Q7QCP8	Q9HGK8

Table 2

SwissProt/UniProt Accession Numbers for Hemoglobin dataset – Part 1.

A1A4Q3	O24520	P01943	P01976	P02006	P02040	P02078	P02111	P02142	P07036	P08256
A1A4Q7	O24521	P01944	P01977	P02007	P02042	P02080	P02112	P02143	P07402	P08257
A1YZP4	O42425	P01945	P01978	P02008	P02044	P02081	P02113	P02208	P07403	P08258
A2TDC2	O76242	P01946	P01979	P02009	P02046	P02082	P02114	P02209	P07404	P08259
A2TDC3	O76243	P01947	P01980	P02010	P02047	P02083	P02115	P02213	P07405	P08260
A2V8C0	O77655	P01948	P01981	P02011	P02048	P02084	P02116	P04237	P07406	P08261
A2V8C1	O81941	P01949	P01982	P02012	P02049	P02085	P02117	P04238	P07407	P08422
A2V8C2	O88752	P01950	P01983	P02013	P02050	P02086	P02118	P04239	P07408	P08423
A4GTS5	O88754	P01951	P01984	P02014	P02051	P02087	P02120	P04240	P07409	P08535
A4GX73	O93348	P01953	P01985	P02015	P02052	P02088	P02121	P04241	P07410	P08849
A6YH87	O93349	P01954	P01986	P02016	P02053	P02089	P02122	P04242	P07411	P08850
A7UAU9	O93351	P01955	P01987	P02017	P02054	P02090	P02123	P04244	P07412	P08851
A9JSP7	O96457	P01956	P01988	P02018	P02055	P02091	P02124	P04245	P07413	P08852
A9XDF6	P01923	P01957	P01989	P02019	P02057	P02092	P02125	P04246	P07414	P08853
B0BL34	P01924	P01958	P01990	P02020	P02058	P02093	P02126	P04252	P07415	P09105
B0BL35	P01926	P01959	P01991	P02021	P02059	P02094	P02127	P04346	P07417	P09420
B1H216	P01928	P01960	P01992	P02022	P02060	P02095	P02128	P04442	P07419	P09421
B1Q450	P01929	P01961	P01993	P02024	P02061	P02097	P02129	P04443	P07421	P09422
B2BP38	P01930	P01962	P01994	P02025	P02062	P02099	P02130	P04444	P07425	P09423
B2ZUE0	P01932	P01963	P01995	P02026	P02064	P02100	P02131	P06148	P07428	P09839
O02004	P01933	P01964	P01996	P02028	P02065	P02101	P02132	P06467	P07429	P09840
O02480	P01934	P01965	P01997	P02029	P02066	P02102	P02133	P06635	P07430	P09904
O04985	P01935	P01966	P01998	P02030	P02067	P02103	P02134	P06636	P07431	P09905
O04986	P01936	P01967	P01999	P02031	P02070	P02104	P02135	P06637	P07432	P09906
O09232	P01937	P01968	P02000	P02032	P02072	P02105	P02136	P06638	P07433	P09907
O12985	P01938	P01969	P02001	P02033	P02073	P02106	P02137	P06639	P07803	P09908
O13077	P01939	P01971	P02002	P02035	P02074	P02107	P02138	P06642	P08054	P09909
O13078	P01940	P01972	P02003	P02036	P02075	P02108	P02139	P06643	P08223	P0C0U6
O13163	P01941	P01973	P02004	P02038	P02076	P02109	P02140	P07034	P08224	P0C0U7
O13164	P01942	P01975	P02005	P02039	P02077	P02110	P02141	P07035	P08225	P0C0U8

Table 3

SwissProt/UniProt Accession Numbers for Hemoglobin dataset – Part 2.

P0C237	P14524	P20018	P41327	P67822	P68194	P83132	Q27940	Q6AW44	Q8AXX7	Q9TSN9
P0C238	P14525	P20019	P41328	P67823	P68222	P83133	Q28220	Q6AW45	Q8AYL9	Q9TVA3
P0C239	P14526	P20243	P41329	P67824	P68223	P83134	Q28221	Q6B0K9	Q8AYM0	Q9U6L2
P0C240	P14527	P20244	P41330	P68011	P68224	P83135	Q28338	Q6BBJ0	Q8AYM1	Q9U6L6
P10057	P15162	P20245	P41331	P68012	P68225	P83270	Q28356	Q6BBJ2	Q8BPF4	Q9XSE9
P10058	P15163	P20246	P41332	P68013	P68226	P83271	Q28496	Q6BBK1	Q8BYM1	Q9XSK1
P10059	P15164	P20247	P45718	P68014	P68227	P83272	Q28507	Q6H1U7	Q8GV40	Q9XSN2
P10060	P15165	P20854	P45719	P68015	P68228	P83273	Q28775	Q6IBF6	Q8GV41	Q9XSN3
P10061	P15166	P20855	P45720	P68016	P68229	P83478	Q28779	Q6LDH0	Q8GV42	Q9XTL1
P10062	P15448	P21197	P45721	P68017	P68230	P83479	Q28931	Q6LDH1	Q8HY34	Q9XTM2
P10777	P15449	P21198	P45722	P68018	P68231	P83611	Q28932	Q6QDC2	Q8QG65	Q9Y0D5
P10778	P15469	P21199	P51438	P68019	P68232	P83612	Q29415	Q6R7N2	Q90485	Q9Y0E6
P10780	P16309	P21200	P51440	P68020	P68234	P83613	Q2KPA3	Q6S9E2	Q90486	Q9YGW1
P10781	P16417	P21201	P51441	P68021	P68235	P83614	Q2KPA4	Q6T497	Q90487	Q9YGW2
P10782	P16418	P21379	P51442	P68022	P68236	P83623	Q2MHE0	Q6WN20	Q90ZM4
P10783	P17689	P21380	P51443	P68023	P68237	P83624	Q2PAD4	Q6WN21	Q91473
P10784	P18435	P21667	P51465	P68024	P68238	P83625	Q3C1F3	Q6WN22	Q941P9
P10785	P18436	P21668	P55267	P68025	P68239	P84203	Q3C1F4	Q6WN25	Q941Q1
P10786	P18707	P21766	P56250	P68026	P68240	P84204	Q3MQ26	Q6WN26	Q946U7
P10883	P18969	P21767	P56251	P68027	P68256	P84205	Q3S3T0	Q6WN27	Q947C5
P10885	P18970	P21768	P56285	P68028	P68257	P84206	Q3TUN7	Q6WN28	Q94FG6
P10892	P18971	P21871	P56691	P68029	P68258	P84216	Q3U0A6	Q6WN29	Q94FT7
P10893	P18972	P22740	P56692	P68030	P68871	P84217	Q3Y9L5	Q6Y239	Q94FT8
P11025	P18973	P22741	P60523	P68031	P68872	P84479	Q3Y9L6	Q6Y257	Q95190
P11251	P18974	P22742	P60524	P68044	P68873	P84604	Q42665	Q760P9	Q95238
P11342	P18975	P22743	P60525	P68045	P68944	P84609	Q42831	Q760Q0	Q95NK8
P11517	P18976	P23016	P60526	P68046	P68945	P84610	Q43306	Q760Q1	Q966U3
P11748	P18977	P23017	P60529	P68047	P69891	P84611	Q45V69	Q760Q2	Q96FH6
P11749	P18978	P23018	P60530	P68048	P69892	P84652	Q45V70	Q78PA4	Q98905
P11750	P18981	P23019	P61772	P68049	P69905	P84653	Q45XH3	Q7JFN6	Q98TS0
P11751	P18982	P23020	P61773	P68050	P69906	P84790	Q45XH5	Q7JFR7	Q9AWA9
P11752	P18983	P23600	P61774	P68051	P69907	P84791	Q45XH6	Q7LZB9	Q9CWS5
P11753	P18984	P23601	P61775	P68052	P80043	P84792	Q45XH7	Q7LZC1	Q9CY06
P11754	P18985	P23602	P61920	P68053	P80044	P85081	Q45XH8	Q7LZC2	Q9CY10
P11755	P18986	P23740	P61921	P68054	P80216	P85082	Q45XI4	Q7LZC3	Q9CZK5
P11756	P18987	P23741	P61947	P68055	P80270	Q03902	Q45XI5	Q7LZL6	Q9D0B2
P11757	P18988	P24291	P61948	P68056	P80271	Q0PB48	Q45XI6	Q7LZM6	Q9DF25
P11758	P18989	P24292	P62363	P68057	P80726	Q0PG38	Q45XI7	Q7LZM7	Q9FUD6
P11896	P18990	P24589	P62387	P68058	P80727	Q0WSU5	Q45XI8	Q7LZM8	Q9FVL0
P13273	P18993	P24659	P62741	P68059	P80945	Q0ZA50	Q45XI9	Q7M2Y4	Q9FY42
P13274	P18994	P24660	P62742	P68060	P80946	Q10732	Q45XJ0	Q7M2Y5	Q9GJS7
P13557	P18995	P26915	P63105	P68061	P81023	Q10733	Q4F6Z2	Q7M3B6	Q9GLX4
P13558	P18996	P26916	P63106	P68062	P81024	Q17153	Q4VIX3	Q7M3B8	Q9I9I3
P14259	P19002	P28780	P63107	P68063	P81042	Q17155	Q53I65	Q7M3C2	Q9M3U9
P14260	P19014	P28781	P63108	P68064	P81043	Q17156	Q549D9	Q7M413	Q9M593
P14261	P19015	P29623	P63109	P68065	P82111	Q17157	Q549G1	Q7M418	Q9M630
P14387	P19016	P29624	P63110	P68068	P82112	Q17286	Q58L97	Q7M419	Q9PRL9
P14388	P19645	P29625	P63111	P68069	P82113	Q1AGS4	Q5BLF6	Q7M421	Q9PVM1
P14389	P19646	P29626	P63112	P68070	P82315	Q1AGS5	Q5GLZ6	Q7M422	Q9PVM2
P14390	P19759	P29628	P67815	P68071	P82316	Q1AGS6	Q5GLZ7	Q7T1B0	Q9PVM3
P14391	P19760	P30892	P67816	P68077	P82345	Q1AGS7	Q5I122	Q7Y079	Q9PVM4
P14392	P19789	P30893	P67817	P68078	P82990	Q1AGS8	Q5KSB7	Q7Y1Y1	Q9PVU6
P14520	P19831	P33499	P67818	P68079	P83114	Q1AGS9	Q5MD69	Q7ZT21	Q9TS34
P14521	P19832	P41260	P67819	P68087	P83123	Q1W6G9	Q5RM02	Q803Z5	Q9TS35
P14522	P19885	P41261	P67820	P68168	P83124	Q26505	Q5XLE5	Q862A7	Q9TSN7
P14523	P19886	P41262	P67821	P68169	P83131	Q27126	Q67XG0	Q86G74	Q9TSN8

Table 4

SwissProt/UniProt Accession Numbers for Myoglobin dataset.

O77003	P02159	P02174	P02190	P02205	P14393	P51535	P68086	Q01966	Q701N9	Q9DEP1
P02144	P02160	P02177	P02191	P02206	P14396	P51537	P68189	Q03459	Q76G09	Q9DGI8
P02145	P02161	P02178	P02192	P02210	P14397	P56208	P68190	Q0KIY0	Q7LZM1	Q9DGI9
P02147	P02163	P02179	P02193	P02211	P14398	P62734	P68276	Q0KIY1	Q7LZM2	Q9DGJ0
P02148	P02164	P02180	P02194	P02214	P14399	P62735	P68277	Q0KIY2	Q7LZM3	Q9DGJ1
P02150	P02165	P02181	P02196	P02215	P15160	P63113	P68278	Q0KIY3	Q7LZM4	Q9DGJ2
P02151	P02166	P02182	P02197	P04247	P17724	P63114	P68279	Q0KIY5	Q7LZM5	Q9QZ76
P02152	P02167	P02183	P02199	P04248	P20856	P68080	P80721	Q0KIY7	Q7M416
P02153	P02168	P02184	P02200	P04249	P29287	P68081	P80722	Q0KIY9	Q7M424
P02154	P02169	P02185	P02201	P04250	P30562	P68082	P83682	Q2MJN4	Q7T044
P02155	P02170	P02186	P02202	P09965	P31331	P68083	P84997	Q6I7B0	Q9DEN8
P02156	P02171	P02187	P02203	P0C227	P32428	P68084	P85077	Q6PL31	Q9DEN9
P02157	P02173	P02189	P02204	P11343	P49672	P68085	P87497	Q6VN46	Q9DEP0

Visualization of the Feature Vectors

The feature vector is high dimensional. PCA is a popular method of reducing the number of variables in a vector. This method provides a linear transformation of the variables of the feature vector into a new set of uncorrelated variables. In so doing, it captures the dominant variations in the data set. The first two principal components contain more information than any other pair of linearly constructed variables and thus are used in our analysis [6]. This allows us to easily visualize the key elements of the data without loss of much information. An alternative method for visualizing the patterns emerging from the feature vector method of analysis is to create a dendrogram using agglomerative hierarchical clustering. We use agglomerative hierarchical clustering with complete linkage [7] to provide a more detailed view of the data, and of the relationships between groups within the data. In order to demonstrate the utility and flexibility of the feature vector method, we provide one dendrogram as part of the PKC analysis.

Identification of Protein Sub-Types

Kinases are proteins which modify other proteins by phosphorylation, the covalent addition of phosphate groups [8]. The PKC family is a large multigene family of serine/threonine kinases [8]. Six main groups of PKCs can be identified by domain architecture: conventional, novel, atypical, PKCμ-like, fungal PKC1, and PKC-related kinases [9]. The first three of these groups can be further categorized into subtypes [9]. In general, the PKC domain architecture (fig. 2A) consists of a regulatory region and a catalytic domain [10]. The regulatory region contains several functional domains of varying types [10]. True PKCs are classified as conventional, novel or atypical based on the functional domains present in the regulatory regions of the PKCs (fig. 2A). Briefly, conventional PKCs contain subtypes α, βI, βII and γ; novel PKCs contain subtypes θ, ε, δ and η; and atypical PKCs contain λ\ι and ζ [9]. The catalytic domain is more conserved and more commonly used for differentiating between families of protein kinases [11]. However, it is also useful in characterizing the PKCμ-like kinases, which contain markedly different catalytic regions from the rest of the protein kinase C members [12].

Figure 2

Analysis of the Protein Kinase C (PKC) family using the feature vector method.

Analysis of the Protein Kinase C (PKC) family using the feature vector method.

(A) Structural architecture types of PKCs used in analysis, showing regulatory domains – C1, C2, PB1, HR1 and PH – and catalytic domains. cPKC – conventional PKCs, aPKC – atypical PKCs, nPKC – novel PKCs, PKC1 – fungal PKC1s, PRK – PKC-related kinases, PKCμ – PKCμ-like PKCs; (B) agglomerative hierarchical clustering dendrogram of PKC feature vectors for structural architecture types; (C) principle component analysis (PCA) of PKC feature vectors, coded by known architectural type of proteins – red dots: cPKCs; blue dots: aPKCs; green dots: nPKCs; black dots: PKC1s; orange dots: PKCμ; yellow dots: PRKs; dashed lines indicate dividing surfaces for identifying major clusters in the data set. The dendrogram created by agglomerative hierarchical clustering of the PKC feature vectors (fig. 2B) successfully recreates the phylogenetic relationships between PKC architectural types and highlights the degree of difference between PKC-related kinases, PKC1s and other PKCs. By running a PCA on the feature vector values for each PKC protein, we were able to quickly visualize the six architecture types of PKCs in our dataset. Figure 2C shows the PCA output for PKCs. A dashed line marking a clear dividing surface is added to this figure to demonstrate divisions in the data that warrant further analysis. Fungal PKC1s are clearly separated from other PKCs and can be identified as an important, distinct grouping (fig. 2C). In addition, PKC-related kinases and true PKCs are located in distinct clusters (fig. 2C). Finally, conventional PKCs and novel PKCs are resolved into distinct clusters (fig. 2C). A similar identification of hemoglobin structure was also possible (fig. 3). Hemoglobins are large proteins which function in oxygen transport [13]. Each hemoglobin molecule contains 2 alpha chains and 2 beta chains, subunits which are identifiable by structural characteristics [13]. Alpha hemoglobins lack a specific alpha-helix, the D helix, that is present in beta hemoglobins [14]. Many organisms have several distinct hemoglobins, an adult form and embryonic or fetal forms created by combining different alpha and beta hemoglobin units [13]. There are more types of embryonic and fetal alpha hemoglobins than beta hemoglobins, and thus alpha hemoglobins are presumed evolutionarily older [13]. In protein databases, alpha and beta chain hemoglobins belonging to a given species are typically recorded in separate entries and so are separate in our data set also.

Figure 3

Analysis of hemoglobin proteins using the feature vector method.

Analysis of hemoglobin proteins using the feature vector method.

PCA results of hemoglobin feature vectors, coded by known protein type – red dots: beta-chain hemoglobins; blue dots: alpha-chain hemoglobins, including fetal alpha-type proteins; green dots: leghaemoglobins; grey dots: other hemoglobins. Using the feature vector method, the PCA identifies the difference between alpha chain hemoglobins, beta chain hemoglobins and leghaemoglobins, a type of monomeric hemoglobin chain found in plants [15], as the main features of the protein set (fig. 3). A range of other types of hemoglobins occur in the dataset in small numbers, but are not well resolved into distinct clusters due to their rarity in the data (see below). These proteins include the non-symbiotic plant hemoglobins, also called truncated or 2-on-2 hemoglobins [16]; the lamprey/hagfish hemoglobins [17]; bacterial hemoglobins; and erythrocruorins, which are large extracellular hemoglobins found in annelid worms and arthropods [18]. The feature vector method is able to store large amounts of information about the proteins in the dataset. When the 150 myoglobins are added to the hemoglobin dataset, the feature vector is able to distinguish these two types of proteins (fig. 4A), while retaining information about structural relationships within the hemoglobin family (fig. 4B). Myoglobins are single chain hemoproteins and share a common ancestor with hemoglobins, more than 500 million years ago [19]. Structurally, myoglobins are similar to leghaemoglobins, but functionally these proteins are quite different, with leghaemoglobins having significantly higher oxygen affinity and a broad range of functions within plant nodules [15], [19], [20]. The feature vector method, combined with PCA for visualization, clearly separates the majority of myoglobins from hemoglobins (fig. 4A), while preserving the ability to identify structural relationships between alpha and beta hemoglobins and leghaemoglobins (fig. 4B).

Figure 4

Identification of protein type as myoglobin and hemoglobin by feature vector analysis.

Identification of protein type as myoglobin and hemoglobin by feature vector analysis.

(A) PCA results of hemoglobin and myoglobin feature vectors, coded by known protein family – red dots: all hemoglobins; black triangles: myoglobins; (B) PCA results of hemoglobin and myoglobin feature vectors, coded by known protein type with hemoglobins identified by subtype – red dots: beta-chain hemoglobins; blue dots: alpha-chain hemoglobins, including fetal alpha-type proteins; green dots: leghaemoglobins; grey dots: other hemoglobins; black triangles: myoglobins. When analyzing a large and varied protein dataset, some protein types may occur infrequently. These rare protein types are more difficult to cluster using PCA due to the limited amount of information available to the algorithm. As a result, these rare proteins cluster near the center of the PCA output, creating ‘noise’ in the analysis (fig. 3). By limiting the hemoglobin dataset to only adult alpha and beta chain mammalian hemoglobins and mammalian myoglobins, the most frequent protein types present in the dataset, the ability of the feature vector method to create clear separation between different groups of proteins is readily apparent (fig. 5). As in previous figures, dashed lines indicating decision surfaces are used to highlight clusters warranting further analysis.

Figure 5

Use of the feature vector method allows unequivocal protein identification when data is limited to large, well-defined protein types.

PCA results of adult alpha and adult beta mammalian hemoglobins and mammalian myoglobin feature vectors, coded by known protein type to demonstrate the ability of the feature vector to produce perfect separation of types – dashed lines indicate dividing surfaces for identifying clusters in the data; red dots: beta-chain mammalian hemoglobins; blue dots: alpha-chain mammalian hemoglobins, excluding fetal proteins; black triangles: mammalian myoglobins.

Use of the feature vector method allows unequivocal protein identification when data is limited to large, well-defined protein types.

Discussion

The feature vector method described here is intended to measure the distance between protein sequences in a way that makes numerical comparisons easy and allows identification of similarity within large numbers of proteins that are not too distantly related. Using PKCs and hemoproteins as examples, we demonstrated the effectiveness of this method. When groups are completely distinct, perfect separation can be achieved; where there are gradual changes in the sequences of proteins, the feature vector performs well in conjunction with principal components analysis. Importantly, this method does not attempt to characterize differences as functional or non-functional, nor does it seek to identify key single point mutations. Rather, the goal is to provide a rapid understanding of the patterns of relatedness in large datasets of protein sequences. Although protein kinase C was one of the first protein kinases discovered [21], categorizing members of this family is particularly challenging [22], [23]. Our method successfully reproduces the traditional classification of PKCs and clusters family members on the basis of these classifications [24]. Previous work analyzing relationships among multiple PKCs or among the larger kinase superfamily has been limited by the maximum dataset size [23], [24], [25], in a way that our method is not. The statistical feature vector method is particularly useful as a simple way of identifying subgroups in non-mammalian PKCs, an area where little is known. In the future, more detailed visualization techniques may suggest new relationships which could be explored experimentally. Mammalian hemoglobins are also well understood, in terms of classification [16]. However, research is increasingly identifying hemoglobin-like proteins outside of mammals, including bacterial hemoglobins and non-symbiotic hemoglobins in plants [16]. As the number of hemoglobins identified in these organisms increases, the feature vector method will provide a simple tool for identifying structural groupings within these proteins. The feature vector method provides one of the most definitive ways of classifying various types of proteins. This method provides an advantage over other classification programs in ease of use and, unlike other methods, the feature vector is not constrained to a single protein family or superfamily [23]. We have shown the usefulness of this method in PKCs and hemoproteins, and we anticipate that it will perform equally well when applied to other protein families providing a simple, rapid tool for sorting through the increasingly large datasets of proteins now available to researchers. In the future, the utility of this method can be increased by applying new, and more specific, visualization tools to the analysis of the feature vector output, such as K-means, agglomerative hierarchical clustering, artificial neural networks and self-organizing maps. For a given data set, the patterns of variation in sequences can be learned by neural networks, or other methods, to provide a more accurate classification or clustering than can be achieved with less flexible methods like principal components analysis.

Methods

Datasets

We used three online protein sequence databases to create our protein datasets: Uniprot KB, UniprotKB/Swissprot, and NCBI Entrez-Protein. UniprotKB (www.uniprot.org) is an online repository of protein sequences; UniprotKB/Swissprot (http://ca.expasy.org/sprot/) builds upon this repository through annotation of protein sequences. Information available in UniprotKB/Swissprot includes citations for related publications, species name, protein family, domain structure and detail on protein variants and structure. NCBI Entrez-Protein (http://www.ncbi.nlm.nih.gov/protein/) is an online protein sequence database curated by the National Center for Biotechnology Information (NCBI). The protein kinase C dataset of 127 protein sequences was downloaded from the NCBI Entrez-Protein and UniProtKB/SwissProt databases. The hemoglobin and myoglobin datasets, of 904 and 150 protein sequences respectively, were downloaded from the UniProtKB database. In order to ensure that sequences were not fragments or labeled incorrectly by protein family, sequences were analyzed using the SMART domain recognition software on the UniProtKB website. In addition, for all sequences the family classification was confirmed and the subfamily classification was assigned based on peer-reviewed journal articles which were obtained through the SwissProt database reference listings and based on notations on the UniProtKB entries where detailed information from articles was not available. Sample Parameter Calculations. This file works through the calculations of the parameters for Alanine in a short, hypothetical protein, and demonstrates the construction of the feature vector for this protein. (0.05 MB PDF) Click here for additional data file. Feature vectors computation in C++. Computation of the feature vectors for a protein data set in C++. (0.01 MB TXT) Click here for additional data file. Feature vector computation in Mathematica. Computation of the feature vector for a single protein in Mathematica. (0.00 MB TXT) Click here for additional data file. PCA code for Matlab. Principle component analysis code for Matlab. (0.00 MB TXT) Click here for additional data file. Protein Kinase C Feature Vectors. This file contains the set of all feature vectors for the PKC proteins in our dataset. (0.16 MB XLS) Click here for additional data file. Hemoglobin Feature Vectors. This file provides the set of all feature vectors for the Hemoglobins in our dataset. (1.03 MB XLS) Click here for additional data file. Myoglobin Feature Vectors. This file provides the set of all feature vectors for the Myoglobins in our dataset. (0.17 MB XLS) Click here for additional data file. Protein datasets. Accession numbers and taxonomic information for Protein Kinase C (PKC), Hemoglobin and Myoglobin dataset. Each protein dataset is provided as a separate worksheet. (0.16 MB XLS) Click here for additional data file.

22 in total

Review 1. Plants, humans and hemoglobins.

Authors: Suman Kundu; James T Trent; Mark S Hargrove
Journal: Trends Plant Sci Date: 2003-08 Impact factor: 18.313

Review 2. The protein kinase family: conserved features and deduced phylogeny of the catalytic domains.

Authors: S K Hanks; A M Quinn; T Hunter
Journal: Science Date: 1988-07-01 Impact factor: 47.728

3. Multiple, distinct forms of bovine and human protein kinase C suggest diversity in cellular signaling pathways.

Authors: L Coussens; P J Parker; L Rhee; T L Yang-Feng; E Chen; M D Waterfield; U Francke; A Ullrich
Journal: Science Date: 1986-08-22 Impact factor: 47.728

4. Lamprey hemoglobin. Structural basis of the bohr effect.

Authors: Y Qiu; D H Maillett; J Knapp; J S Olson; A F Riggs
Journal: J Biol Chem Date: 2000-05-05 Impact factor: 5.157

Review 5. Myoglobin: an essential hemoprotein in striated muscle.

Authors: George A Ordway; Daniel J Garry
Journal: J Exp Biol Date: 2004-09 Impact factor: 3.312

6. Complete amino acid sequences of the major early embryonic alpha-like globins of the chicken.

Authors: B S Chapman; A J Tobin; L E Hood
Journal: J Biol Chem Date: 1980-10-10 Impact factor: 5.157

Review 7. Protein kinase D: a family affair.

Authors: An Rykx; Line De Kimpe; Svetlana Mikhalap; Tibor Vantus; Thomas Seufferlein; Jackie R Vandenheede; Johan Van Lint
Journal: FEBS Lett Date: 2003-07-03 Impact factor: 4.124

Review 8. Protein kinase C isozymes as potential targets for anticancer therapy.

Authors: Johann Hofmann
Journal: Curr Cancer Drug Targets Date: 2004-03 Impact factor: 3.428

9. The relationship of protein conservation and sequence length.

Authors: David J Lipman; Alexander Souvorov; Eugene V Koonin; Anna R Panchenko; Tatiana A Tatusova
Journal: BMC Evol Biol Date: 2002-11-01 Impact factor: 3.260

10. Kinomer v. 1.0: a database of systematically classified eukaryotic protein kinases.

Authors: David M A Martin; Diego Miranda-Saavedra; Geoffrey J Barton
Journal: Nucleic Acids Res Date: 2008-10-30 Impact factor: 16.971

5 in total

1. A novel method of characterizing genetic sequences: genome space with biological distance and applications.

Authors: Mo Deng; Chenglong Yu; Qian Liang; Rong L He; Stephen S-T Yau
Journal: PLoS One Date: 2011-03-02 Impact factor: 3.240

2. Computational Prediction and Analysis of Associations between Small Molecules and Binding-Associated S-Nitrosylation Sites.

Authors: Guohua Huang; Jincheng Li; Chenglin Zhao
Journal: Molecules Date: 2018-04-19 Impact factor: 4.411

3. Identification of Enzymes-specific Protein Domain Based on DDE, and Convolutional Neural Network.

Authors: Rahu Sikander; Yuping Wang; Ali Ghulam; Xianjuan Wu
Journal: Front Genet Date: 2021-11-30 Impact factor: 4.599

4. DFA7, a new method to distinguish between intron-containing and intronless genes.

Authors: Chenglong Yu; Mo Deng; Lu Zheng; Rong Lucy He; Jie Yang; Stephen S-T Yau
Journal: PLoS One Date: 2014-07-18 Impact factor: 3.240

5. Broad cross-reactive IgG responses elicited by adjuvanted vaccination with recombinant influenza hemagglutinin (rHA) in ferrets and mice.

Authors: Jiong Wang; Shannon P Hilchey; Marta DeDiego; Sheldon Perry; Ollivier Hyrien; Aitor Nogales; Jessica Garigen; Fatima Amanat; Nelson Huertas; Florian Krammer; Luis Martinez-Sobrido; David J Topham; John J Treanor; Mark Y Sangster; Martin S Zand
Journal: PLoS One Date: 2018-04-11 Impact factor: 3.240

5 in total