Literature DB >> 25364645

Structural phylogeny by profile extraction and multiple superimposition using electrostatic congruence as a discriminator.

Sandeep Chakraborty¹, Basuthkar J Rao¹, Nathan Baker², Bjarni Asgeirsson³.

Abstract

Phylogenetic analysis of proteins using multiple sequence alignment (MSA) assumes an underlying evolutionary relationship in these proteins which occasionally remains undetected due to considerable sequence divergence. Structural alignment programs have been developed to unravel such fuzzy relationships. However, none of these structure based methods have used electrostatic properties to discriminate between spatially equivalent residues. We present a methodology for MSA of a set of related proteins with known structures using electrostatic properties as an additional discriminator (STEEP). STEEP first extracts a profile, then generates a multiple structural superimposition providing a consolidated spatial framework for comparing residues and finally emits the MSA. Residues that are aligned differently by including or excluding electrostatic properties can be targeted by directed evolution experiments to transform the enzymatic properties of one protein into another. We have compared STEEP results to those obtained from a MSA program (ClustalW) and a structural alignment method (MUSTANG) for chymotrypsin serine proteases. Subsequently, we used PhyML to generate phylogenetic trees for the serine and metallo-β-lactamase superfamilies from the STEEP generated MSA, and corroborated the accepted relationships in these superfamilies. We have observed that STEEP acts as a functional classifier when electrostatic congruence is used as a discriminator, and thus identifies potential targets for directed evolution experiments. In summary, STEEP is unique among phylogenetic methods for its ability to use electrostatic congruence to specify mutations that might be the source of the functional divergence in a protein family. Based on our results, we also hypothesize that the active site and its close vicinity contains enough information to infer the correct phylogeny for related proteins.

Entities: Chemical Disease Gene Mutation Species

Year: 2013 PMID： 25364645 PMCID： PMC4212511 DOI： 10.4161/idp.25463

Source DB: PubMed Journal: Intrinsically Disord Proteins ISSN： 2169-0707

Introduction

DNA sequencing technologies have provided a quantitative foundation for our understanding of evolution, which was previously based on logical, yet empirical, observations. The chronology of the development of computational techniques has closely followed innovations in biotechnology. Pairwise alignment algorithms of nucleotide sequences, both global and local, were enhanced to incorporate multiple sequences from related proteins.- Such multiple sequence alignment (MSA) methods enabled visualization of evolutionary pathways through phylogenetic trees., While considerable divergence in sequence often resembles noise and masks true relationships, structural conservation in such cases have provided the basis for evolutionary kinship. For instance, MSA techniques are not applicable to the serine and metallo-β-lactamase superfamilies due to significant sequence divergence.- Lately, rapid strides in crystallization techniques have fueled progress in structural alignment methods, both for pairwise- and multiple- proteins. The program MAPS (an extension of the program TOP), which has been used for the structural analysis of metallo-β-lactamases, first superimposes the proteins and then computes the phylogeny based on structural similarity of the main and side-chain atoms. A widely used methodology for structural alignment (MUSTANG) uses a simple dynamic programming algorithm for all pairs of structures and applies a robust scoring scheme obviating the need for troublesome gap penalties. A recent method uses many informative features (torsion angles, secondary structure, residue type, surface accessibility, etc.) to guide the alignment. An innovative technique for alignment allows local flexibility between fragments which might be physically impossible under rigid body transformations and restores geometric consistency at the end. Another multiple protein alignment method (MISTRAL) uses the minimization of an empirical energy function of the relative rotations and translations of the molecules. However, such methods have not addressed the problem of identifying residues which, although spatially equivalent, have diverged from a stereochemical and electrostatic perspective resulting in functional plasticity. In the current work, we present a methodology for generating the MSA of a set of related proteins with known structures, using electrostatic properties as an additional discriminator - Structure and electrostatic potential based multiple sequence alignment (STEEP). We demonstrate that residues identified by comparing the alignments obtained by including and excluding electrostatic properties can be targeted by directed evolution experiments to transform the enzymatic properties of one protein into another. We also show that the active site vicinity contains enough information to infer correct kinship in a set of related proteins. Previous work by our group has established the spatial and electrostatic congruence in cognate residue pairs of the active site in proteins with the same functionality - CataLytic Active Site Prediction (CLASP). CLASP was used to unravel a serine protease scaffold in alkaline phosphatases, and a scaffold recognizing a β-lactam (imipenem) in a cold-active Vibrio alkaline phosphatase., STEEP superimposes the proteins based on the active site motif specified in one of the proteins by extracting matching scaffolds using CLASP, thus pruning out unrelated proteins which are known to affect the quality of MSA results. It then considers the reactive atoms of the residues in the superimposed cluster while matching the distance, and as an additional option uses electrostatic criteria to prune out non-congruent residues, and emits the MSA for the set of proteins. Such a constrained alignment highlights the conserved residues from an electrostatic perspective as well. Comparison of these alignments could form the basis of mutations in directed evolution experiments that intend to endow the desired protein with certain enzymatic properties. We have compared results obtained with STEEP to those obtained from a sequence based MSA program (ClustalW), and a structural alignment method (MUSTANG) for a set of chymotrypsin serine proteases. We have also generated phylogenetic trees for the serine and metallo-β-lactamase superfamilies from the STEEP generated MSA using PhyML, and corroborated the accepted relationships of proteins in these 2 superfamilies.- Interestingly, using electrostatic congruence as a discriminator led to a functional classification instead of a true evolutionary relationship. We observe that Trp154 in class D serine β-lactamases (SBL) and signal transduction proteins is spatially equivalent to Glu166 in class A SBL but lacks electrostatic congruence. Although this critical Trp154 has been mutated to Gly, Ala, and Phe with resulting poor catalytic efficiencies and reduced stability, we propose that a mutation to Glu might show functional similarity to class A SBLs by mimicking the Glu166. In summary, STEEP is a multi-faceted methodology that generates evolutionary and functional relationships in a set of related proteins, a multiple superimposition and proposes mutations based on electrostatic properties that might endow the enzymatic functionality of one protein to another. Thus, it helps in narrowing down critical mutations that would be expected to shape the functional plasticity of any given enzyme superfamily, especially in cases where sequence divergence has left little traces of any relationship.

Results

Chymotrypsin serine proteases

Serine proteases are grouped based on structural homology and are then further sub-grouped into families with similar sequences., The 2 major families, chymotrypsin and subtilisin, are a classical example of convergent evolution where the catalytic Ser-His-Asp triad shows very similar geometry in the structurally different chymotrypsin and subtilisin families. We chose a set of eight proteins (PDBids: 2ALP, 1SGT, 1TGS, 2SGA, 1PPF, 3EST, 3RP2 and 1TPP) for analysis based on previous work on serine proteases, barring one (PDBid:5CHA) which did not complete APBS electrostatic analysis (Tables S1 and S2). The motif from a trypsin protein (PDBid:1SBT)—(His57, Asp102, Ser195)―was chosen for representing serine proteases. CLASP analysis using this query motif detected significantly congruent scaffolds in each of the proteins (Table 1).

Table 1. Potential and spatial congruence of the active site residues in proteins from the chymotrypsin superfamily

PDB		ab	ac	bc
2ALP	D	4.7	3.1	6.2
	PD	-13.6	-86.5	-72.9
1SGT	D	5.5	3	8
	PD	4.2	-120.4	-124.5
1TGS	D	5.2	2.6	7.3
	PD	31.6	-85.8	-117.4
2SGA	D	4.6	3	6.2
	PD	59.1	-123.6	-182.6
1PPF	D	5.4	2.5	7.3
	PD	-29	-103.7	-74.6
3EST	D	4.6	3.2	6.4
	PD	-3.7	-124	-120.3
3RP2	D	5	3.1	6.5
	PD	-51.9	-136.4	-84.5
1TPP	D	5.5	2.7	7.6
	PD	-83.9	-162.5	-78.6

The active site atoms are HIS57NE2 (a), ASP102OD1 (b) and SER195OG (c). D = Pairwise distance in Å. PD = Pairwise potential difference. The electrostatic potential is in dimensionless units of kT/e where k is Boltzmann's constant, T is the temperature in K and e is the charge of an electron. The structural profile was used to generate a multiple superimposition of the proteins (see Materials and Methods), and provided a single frame of reference for comparing the proteins with STEEP (Fig. 1A). Since all the structures could be superimposed, we proceeded to finding the residues from other proteins in the set which were spatially close to residues of the template protein. Figure 2A and Figure 2C show the alignment and the cladogram using only spatial constraints. The alignment (Fig. 2B) and the cladogram (Fig. 2D) taking electrostatic congruence into consideration resulted in a different phylogeny from the one generated using just spatial constraints. Later, we show for serine β-lactamases (SBL) that this relationship suggests functional relationships rather than true evolutionary kinship. For example, classes A and C SBLs appear as sister taxa when using electrostatic congruence as a discriminator, and it is known that both classes A and C SBLs have the ability to hydrolyze cephalosporins. This is a reasonable finding considering that electrostatic fields have a direct bearing on specificity and functionality.

Figure 2. Multiple sequence alignments using STEEP, ClustalW or MUSTANG, and phylogenetic trees generated using PhyML for the chymotrypsin superfamily. The active site motif is marked as `*'. The residues used to initiate STEEP were within a radius of 9 Å from the specified active site residues. (A) Alignment using spatial proximity using STEEP. (B) Alignment using spatial proximity and electrostatic congruence using STEEP. (C) Cladogram generated from (A). (D) Cladogram generated from (B). (E) Alignment using ClustalW. (F) Alignment using MUSTANG. (G) Cladogram generated from (E) (ClustalW). (H) Cladogram generated from (F) (MUSTANG).

Figure 1. Superimposing multiple proteins based on the homologous active site scaffolds for trypsin serine proteases. (A) STEEP generated superimposition, where each amino acid is represented by a user defined reactive atom. (B) MUSTANG generated superimposition. It can be seen that MUSTANG generates a better overall superimposition, but the active site residues are less dispersed after the superimposition by STEEP. (C) STEEP generated superimposition, where each amino acid is represented by the Cα atom. Figure 2. Multiple sequence alignments using STEEP, ClustalW or MUSTANG, and phylogenetic trees generated using PhyML for the chymotrypsin superfamily. The active site motif is marked as `*'. The residues used to initiate STEEP were within a radius of 9 Å from the specified active site residues. (A) Alignment using spatial proximity using STEEP. (B) Alignment using spatial proximity and electrostatic congruence using STEEP. (C) Cladogram generated from (A). (D) Cladogram generated from (B). (E) Alignment using ClustalW. (F) Alignment using MUSTANG. (G) Cladogram generated from (E) (ClustalW). (H) Cladogram generated from (F) (MUSTANG). We next used ClustalW to generate the alignment (Fig. 2E) and the phylogenetic tree (Fig. 2G) for the same set of proteins. We used the sequences obtained from the PDB files, and not the complete fasta sequence, to ensure a fair comparison with STEEP and MUSTANG. MUSTANG results also showed similar alignments (Fig. 2F) and phylogenetic trees (Fig. 2H). For example, all three methods suggest that the protein groups (2ALP-2SGA), (3EST-3RP2), and (1SGT-1TGS-1TPP-1PPF) are closely related. Qualitatively, the structural alignment obtained from MUSTANG (Fig. 1B) was better than that from STEEP (Fig. 1A), but the active site residues in STEEP were less dispersed since STEEP aligns the proteins based on the active site residues. Table 2 shows the RMSD values for the pairwise comparison of a protein (PDBid:1TGS) with all other proteins. Cα atoms that are within 2 Å of each other are considered to be equivalent. However, it is seen that STEEP comparisons resulted in a better overall fit when each amino acid was represented by the Cα atom rather than the reactive atom (a much smaller RMSD and more equivalent residues (Fig. 1C; Table 2). Since the STEEP methodology is directed at identifying active site residue equivalence, it does not intend to obtain the best global superimposition. Thus, by default the amino acids are represented by their reactive atoms (when this applies).

Table 2. Comparing results obtained with STEEP or MUSTANG for serine proteases

	PDB	RMSD	Residues matched (out of 222)
STEEPCα atoms	1PPF	1	170
	2ALP	1.1	97
	1SGT	1.1	170
	2SGA	1.2	97
	3EST	0.9	187
	3RP2	1.1	179
	1TPP	1.2	160
STEEPreactive atoms	1PPF	1.4	89
	2ALP	1.5	39
	1SGT	1.4	62
	2SGA	1.4	39
	3EST	1.5	51
	3RP2	1.5	48
	1TPP	0.5	206
MUSTANG	1PPF	0.9	176
	2ALP	1.3	96
	1SGT	0.9	177
	2SGA	1.3	100
	3EST	0.9	187
	3RP2	0.9	182

The RMSD obtained for superimposing one protein (PDBid:1TGS, 222 amino acids) with all other proteins are shown. Cα atoms that are within 2 Å of each other are considered to be equivalent. The number of residues matched is another important metric, since an inferior superimposition might have an equivalent RMSD, but align fewer residues. It is seen that when each amino acid is represented by the Cα atom rather than the reactive atom STEEP results in much smaller RMSD and more equivalent residues. To summarize, we show that STEEP generates similar phylogenies as obtained from sequence alignment (ClustalW) and structural alignment (MUSTANG) programs by considering residues in the vicinity of the active site, and also generates a superimposition comparable to the one generated by MUSTANG by simply aligning the active site residues.

Serine and metallo-β-lactamase superfamilies

β-lactamases inactivate antibiotics by hydrolyzing the amide bond of the β-lactam ring. The Ambler classification has four classes―classes A, C, and D have a nucleophilic serine at the active site (SBL), while MBLs or class B β-lactamases are metallo-enzymes requiring zinc for their activity, and have been further divided into three subgroups - B1, B2, and B3 - based on sequence homology. SBLs are characterized by 3 conserved motifs [SXXK, (S/Y)X(N/V) and K(T/S)G]. We constructed the active site motif (Ser70, Lys73, Ser130, Lys234) by choosing at least 1 residue from each of the 3 motifs from a class A SBL (PDBid:1E25). While searching for matches, Ser130 was matched with either Ser or Tyr to accommodate the variability seen in various SBLs. The set of proteins analyzed consisted of three structures from each of the classes A, C and D of SBLs and penicillin binding proteins (PBP), and two structures from signal transduction proteins (Tables S3 and S4). CLASP queried the set of proteins using the active site motif, and detected significantly congruent scaffolds in each of the proteins (Table 3). Thus, these residues represent a structural profile for the serine β-lactamase superfamily.

Table 3. Potential and spatial congruence of the active site residues in proteins from the Serine β-lactamase superfamily

PDB	Active site atoms (a,b,c,d)		ab	ac	ad	bc	bd	cd
1E25	Ser70OG,Lys73NZ,Ser130OG,Lys234NZ,	D	2.8	3.2	4.7	3.6	5.6	2.9
	Class A Serine β-lactamase	PD	-125.6	22.4	-189.1	148.1	-63.5	-211.5
1I2S	Ser70OG,Lys73NZ,Ser130OG,Lys234NZ,	D	2.7	3.2	4.5	3.1	5	2.8
	Class A Serine β-lactamase	PD	-166.4	-35.5	-219.5	130.9	-53.1	-184
1BSG	Ser70OG,Lys73NZ,Ser130OG,Lys234NZ,	D	2.8	3.4	4.7	3.3	5.3	2.9
	Class A Serine β-lactamase	PD	-178.3	-31.4	-188.6	146.8	-10.3	-157.1
2WZX	Ser90OG,Lys93NZ,Tyr177OH,Lys342NZ,	D	3.5	3	4.5	2.6	5	2.8
	Class C Serine β-lactamase	PD	-161.5	-56.9	-153.1	104.6	8.4	-96.2
1KE4	Ser64OG,Lys67NZ,Tyr150OH,Lys315NZ,	D	2.9	3	4.6	3.4	5.6	2.8
	Class C Serine β-lactamase	PD	-228	-10.4	-187.1	217.6	40.9	-176.7
1FR6	Ser64OG,Lys67NZ,Tyr150OH,Lys315NZ,	D	2.9	3.3	4.6	2.4	5	3.1
	Class C Serine β-lactamase	PD	-132.1	-18.4	-164.2	113.7	-32.1	-145.8
1K57	Ser67OG,Lys70NZ,Ser115OG,Lys205NZ,	D	2.8	2.6	4.7	3.1	5.6	3.8
	Class D Serine β-lactamase	PD	-162.9	51.7	-184.7	214.6	-21.8	-236.4
3ISG	Ser67OG,Lys70NZ,Ser115OG,Lys212NZ,	D	3.3	3.9	4.3	4.8	5.3	2.2
	Class D Serine β-lactamase	PD	-246.3	-13.9	-231.5	232.4	14.8	-217.7
1K38	Ser67OG,Lys70NZ,Ser115OG,Lys205NZ,	D	3.1	3.7	4.9	4.7	5.6	2.7
	Class D Serine β-lactamase	PD	-292.5	-50.7	-309.8	241.8	-17.3	-259.1
1QME	Ser337OG,Lys340NZ,Ser395OG,Lys547NZ,	D	2.9	3.2	4.5	2.7	5	3
	Penicillin binding protein	PD	-211.5	-38.2	-242	173.3	-30.5	-203.8
1NZO	Ser44OG,Lys47NZ,Ser110OG,Lys213NZ,	D	3.1	4.2	6.3	5.1	6.8	2.7
	Penicillin binding protein	PD	-241.6	-68.8	-277.9	172.8	-36.2	-209.1
2EX2	Ser62OG,Lys65NZ,Ser306OG,Lys417NZ,	D	2.9	3	4.3	3.3	5	2.9
	Penicillin binding protein	PD	-213.6	-84	-264.8	129.6	-51.2	-180.8
1XA1	Ser59OG,Lys62NZ,Ser107OG,Lys196NZ,	D	2.6	3.5	4.7	3.8	5.8	2.9
	Signal transducer BlaR1	PD	-126.2	73.7	-175.8	199.9	-49.6	-249.5
1NRF	SER402OG,LYS405NZ,SER450OG,LYS539NZ,	D	2.7	3.6	4.7	4.9	6.1	2.8
	Signal transducer BlaR1	PD	-249.6	2.1	-217.7	251.7	31.9	-219.8

D = Pairwise distance in Å. PD = Pairwise potential difference. The electrostatic potential is in dimensionless units of kT/e where k is Boltzmann's constant, T is the temperature in K and e is the charge of an electron. In MBLs, classes B1 and B3 possess a binuclear active site that requires 1 or 2 Zn2+ ions (Zn1 and Zn2 site) for full activity. Subclass B2 enzymes are catalytically active with one Zn2+ ion, while the binding of the second zinc ion has been shown to have inhibitory effects. The active site profile for MBLs was created from 2 residues each from the Zn1 (His118 and His196) and Zn2 (Asp120 and His263) ligands. The set of proteins analyzed consisted of 3 structures from each of the classes B1 and B3, a structure each from the class B2, glyoxalase II and a methyl parathion hydrolase (Tables S5 and S6). As expected, we detected significantly congruent scaffolds in each of the proteins (Table 4). It is this spatial and electrostatic congruence that has been used to identify a scaffold recognizing a β-lactam (imipenem) in a cold-active Vibrio alkaline phosphatase.

Table 4. Potential and spatial congruence of the active site residues in proteins from the metallo- β-lactamase superfamily

PDB	Active site atoms (a,b,c,d)		ab	ac	ad	bc	bd	cd
1ZNB	HIS101NE2,ASP103OD1,HIS162NE2,HIS223NE2,	D	6.3	4.9	9.1	5.8	4.7	6
	Class B1	PD	124.5	152.2	168.3	27.7	43.8	16.1
1DD6	HIS79NE2,ASP81OD1,HIS139NE2,HIS197NE2,	D	6.8	5	9.4	6.1	5.2	6
	Class B1	PD	97	98.6	47.3	1.6	-49.7	-51.3
1M2X	HIS118NE2,ASP120OD1,HIS196NE2,HIS263NE2,	D	6.9	5	9.3	6.1	4.9	6.1
	Class B1	PD	59.9	100.8	6.1	40.9	-53.8	-94.8
3F9O	HIS118NE2,ASP120OD1,HIS196NE2,HIS263NE2,	D	6.8	4.8	10	5.6	5	6.2
	Class B2	PD	-109.3	-180.8	-74.4	-71.6	34.8	106.4
1JT1	HIS118NE2,ASP120OD1,HIS196NE2,HIS263NE2,	D	8	4.5	9.4	6.6	3.1	6.5
	Class B3	PD	245.3	65	152.3	-180.2	-93	87.3
1SML	HIS86NE2,ASP88OD1,HIS160NE2,HIS225NE2,	D	6.3	4.7	9.3	6	4.9	6.1
	Class B3	PD	140.3	93.7	147.2	-46.6	6.9	53.5
3LVZ	HIS103NE2,ASP105OD1,HIS177NE2,HIS242NE2,	D	6.5	4.4	9.4	6.1	4.9	6.3
	Class B3	PD	131	110.6	104.5	-20.4	-26.5	-6.1
1QH5	HIS56NE2,ASP58OD1,HIS110NE2,HIS173NE2,	D	6.5	4.5	9.4	6.6	5	6.6
	glyoxalase II	PD	246.3	249.1	254.4	2.9	8.1	5.2
1P9E	HIS149NE2,ASP151OD1,HIS234NE2,HIS302NE2,	D	6.3	4.4	9.1	6.8	5.1	6.7
	methyl parathion hydrolase	PD	14.3	-72.8	27.4	-87	13.1	100.2

D = Pairwise distance in Å. PD = Pairwise potential difference. The electrostatic potential is in dimensionless units of kT/e where k is Boltzmann's constant, T is the temperature in K and e is the charge of an electron. Figure 3A shows the superimposition of the three class A SBLs, Figure 3B shows the superimposition of one structure each of the classes (A, C, and D) SBLs, while Figure 3C shows the superimposition of a class A SBL, a PBP and a signal transduction protein. Likewise for the MBL superfamily, Figure 3D shows the superimposition of the three class B1 MBLs, Figure 3E shows the superimposition of one structure each of the classes (B1, B2 and B3) MBLs, while Figure 3F shows the superimposition of a class B3 MBL, a human glyoxalase II and a methyl parathion hydrolase.

Figure 3. Superimposing multiple proteins based on the homologous active site scaffolds for serine and metallo-β-lactamases (SBL, MBL). SBL motif = (Ser70, Lys73, Ser130, Lys234), MBL motif = (His118, His196, Asp120 and His263). Ser70 and His118 are colored black and are at the center of the coordinate axes (X = 0, Y = 0, Z = 0) for SBLs and MBLs, respectively. The proteins are colored red, yellow and blue respectively in order of appearance. (A) Three class A SBLs - PDBids:1E25, 1I2S and 1BSG. (B) A class A (PDBid:1E25), a class C (PDBid:1KE4) and a class D (PDBid:3ISG) SBL. (C) A class A SBL (PDBid:1E25), a penicillin binding protein (PDBid:1NZO) and a signal transducer BlaR1 protein (PDBid:1XA1). (D) Three class B1 MBLs―PDBids:1ZNB, 1DD6 and 1M2X. (E) A class B1 (PDBid:1ZNB), a class B2 (PDBid:3F9O) and a class B3 (PDBid:1JT1) MBL. (F) A class B3 MBL (PDBid:3LVZ), a human glyoxalase II (PDBid:1QH5) and a methyl parathion hydrolase (PDBid:1P9E). It can be seen from these superimpositions that the active site shape is conserved, while the proteins accommodate much greater structural changes in the peripheral regions. The superimposition of all proteins is shown in Figure S1. It was noted here that aligning only three residues has the effect of aligning the complete protein, highlighting that the protein sequence accepts only those mutations that do not violate the conserved structure (and electrostatic properties) of the active site. Thus, it is only logical to compare these proteins based on the conserved residues in the vicinity of the active site. Figure 4A and Figure 4C in SBLs (Fig. 4E and G in MBLs) show the alignment and the cladogram, respectively, in the case when we ignore electrostatic congruence with the residues in the template protein. In the scenario where potential difference congruence is used as discriminator, we obtained the alignment shown in Figure 4B and the cladogram of Figure 4D in SBLs (Fig. 4F and H in MBLs) as the alignment and the cladogram.

Figure 4. Multiple sequence alignments obtained using STEEP, and phylogenetic trees generated using PhyML for serine and metallo-β-lactamases (SBL, MBL). The active site motif is marked as `*'. The residues are within a radius of 9 Å from the specified active site residues. AS = alignment using spatial proximity. ASE = alignment using spatial proximity and electrostatic congruence. (A) AS for SBLs. (B) ASE for SBLs. (C) Cladogram generated from (A). (D) Cladogram generated from (B). (E) AS for MBLs. (F) ASE for MBLs. (G) Cladogram generated from (E). (H) Cladogram generated from (F). It has been shown previously that class A and class D SBLs are sister taxa, and the divergence of the class C SBL predated the bifurcation of classes A and D SBLs. Figure 4C corroborates this hypothesis. Simultaneously, Figure 4C conforms to the known relationship between class D SBL and signal transduction proteins. Interestingly, the expected similarity in class A enzymes and some penicillin binding proteins (PDBid:1NZ0) is not apparent from the cladogram. The deletion of a segment from the sequence close to the active site in these PBPs makes it difficult for even structural programs to identify such relationships. Interestingly, when we constrained the MSA using electrostatic congruence criteria, a different relationship emerged (Fig. 4D) which suggests that classes A and C SBLs are sister taxa. This dichotomy is explained by the fact that electrostatic homology often implies functional similarity―and it is known that both classes A and C SBLs have the ability to hydrolyze cephalosporins, unlike class D SBLs which are specialized oxacillanases. Thus, Figure 4D ought to be interpreted as indicating functional relationship along with sequence/structural homology. A similar observation reveals PBPs and signal transduction proteins closer in Figure 4D as compared with Figure 4C, highlighting their functional similarity, namely their inability to hydrolyze β-lactams. It has been shown that the B3 subclass of MBLs is distinct from the B1/B2 subclass based on sequence alignment. Extending this work, it was proposed that functionality in B1/B2 evolved approximately one billion years ago, whereas subclass B3 evolved about two billion years ago before Gram-positive and Gram-negative eubacteria had diverged. The culmination of this work was achieved by applying structural methods to generate a phylogeny which corroborated the above hypotheses, and also included other proteins from the MBL superfamily (like human glyoxalase II and methyl parathion hydrolase). Furthermore, human glyoxalase II and methyl parathion hydrolase were shown to be closely related to subclass B3. Figure 4E, F, G and H demonstrate that the STEEP results corroborate these hypotheses. The MBLs show much more electrostatic homogeneity than SBLs in the related classes (Fig. 4F and Figure 4H). The inhibitory effect of the binding of the second zinc ion in subclass B2 enzymes is also highlighted by the fact that Asn116 has spatial equivalence (Fig. 4E), but lacks electrostatic congruence with the corresponding histidine (His116) in the other subclasses B1 and B3 (Fig. 4F). The MUSTANG generated phylogenetic tree for SBLs did not reflect the accepted relationship in the superfamily (Fig. S2A), since class A and class C enzymes were seen to be sister taxa, rather than of class A and class D enzymes. In fact, this relationship is similar to the functional relationship detected by STEEP by using electrostatic pruning (Fig. 4D). From the complete set used by STEEP, MUSTANG was unable to process one protein each from class A, PBP and signal transduction proteins. The MUSTANG inferred phylogeny in MBLs concurred with the accepted relationship, and with the one detected by STEEP (Fig. 4F; Fig. S2B). As can be done after any MSA, we extracted a profile from the STEEP generated MSA, and extended the initial profile provided as the input motif. The extended profiles for the SBL and MBL superfamilies were created from Figure 4B and Figure 4F, respectively, by choosing columns that have less than 75% gaps (10 in case of SBLs, 6 in the case of MBLs) (Table 5). These extended profiles can be considered as a better representative of the superfamilies.

Table 5. Extending the profile

SBL			MBL
Index	Count	Amino Acid Types	Index	Count	Amino Acid Types
1	10	(F/M/P/I/L)	2	6	(A/Y/E/V)
3	14	(S)	3	7	(S/W/D/P/V)
4	13	(F/T/L/V)	5	7	(H/N/I/Y/L/V)
6	14	(K)	8	7	(S/A/T/D/G)
7	14	(A/M/T/I/L/V)	12	6	(S/T/N)
8	13	(A/F/S/T/N/P/Y/I/L)	13	8	(H)
20	14	(S/Y)	14	8	(H/S/F/A/M/W)
22	10	(N)	15	9	(H)
23	13	(S/W/T/P/Y/V/M/C)	16	8	(A/F/S/W/D/P/G/L)
24	12	(F/S/A/I/G/Y/V)	17	9	(D)
30	11	(S/Q/M/D/K/P/Y/E)	19	6	(T/I/G)
37	14	(K)	20	9	(A/R/P/G)
38	13	(S/T)	22	8	(W/I/L/V)
39	14	(G)	29	6	(F/M/Y/L)
40	10	(F/S/A/T/R)	31	9	(H)
41	12	(A/S/T/E/H/Q/R/I)	32	8	(S/T/D)
42	12	(A/S/W/N/G/L/Y)	36	8	(H/T/D/N/C)
			37	8	(S/D/C)
			38	6	(T/M/I/G/L)
			39	6	(S/T/K/G/L)
			41	8	(A/T/N/P/Y)
			43	8	(D/N/L/Y/E)
			47	8	(A/D/L/Y)
			49	8	(H)

Consensus residues in the SBL and MBL superfamily with respect to spatial location and electrostatic properties. Indexing is with reference to the sequence alignment shown in Figure 4A and Figure 4E for SBL and MBL, respectively. Count is the number of proteins which have a certain amino acid in that index in the alignment. The profile is extended if there are less than 75% gaps. For SBLs, the complete set has 14 proteins, so the required count is 10. For MBLs, the complete set has 9 proteins, so the required count is 6. It is possible to identify residues that lack electrostatic congruence by comparing these two alignments, which can be subjected to site directed mutagenesis techniques designed to mirror the specificity of the desired protein. Previously, we have noted that Leu153 is the best candidate for mimicking Glu166 when we superimposed the class A SBL and the PBP-5 (PDBid:1NZO). We proposed that the L153E PBP-5 mutant might provide greater success in replicating β-lactamase enzymatic efficiency in PBPs than achieved through a similar mutation in a PBP-A from T. elongatus. Figure 4A corroborates the Leu153-Glu166 spatial equivalence, while Figure 4B shows that there is no electrostatic congruence in these 2 residues. Another observation is that the spatially equivalent Trp154 from class D SBLs and signal transduction protein, and Glu166 in class A SBLs (Fig. 4A) also lack electrostatic similarity (Fig. 4B), although both of them are critical for catalysis.,

Discussion

We present a 3-dimensional structure-based method for generating a multiple sequence alignment (MSA) of a set of proteins which additionally incorporates electrostatic properties of the residues in the matching algorithm (STEEP). STEEP requires that the proteins have known structures, and at least 1 of the proteins has known active site residues. This active site motif extracts a structural profile that is then used for the comparison of proteins (e.g., in data banks). The congruence in cognate pairs, seen across various structures within the same protein superfamily (Tables 1, 3, and 4), is non-trivial and is an innate property of the enzymatic function. Subsequently, we applied geometrical transformations to generate a multiple superimposition of the proteins and emit a MSA based on a parameterized distance from the catalytic site. The matching algorithm can either include or exclude electrostatic considerations, resulting in two distinct alignments. Such a technique, applied to distantly related proteins, gives better results when confined to the active site and its close neighborhood rather than including all the residues. We have shown that the chosen distance of 9 Å, which typically includes 30–50 residues, gives an equivalent phylogeny as determined by a larger number of residues. This is because the active site and its vicinity have the highest `inertia' when it comes to mutations, and thus preserves the largest information of its lineage. A comparison of the alignments that either include or exclude electrostatic congruence provides key residues that are possible candidates for mutations intending to transfer the functionality of 1 protein into another by directed evolution strategies. STEEP can easily be incorporated in the PALI database which provides structure-based sequence alignments for homologous proteins with known structures. Currently, the PALI database uses DALI to implement pairwise superimpositions and MUSTANG to superimpose multiple structures. This will extend the database to display functional relationships in the homologous protein sets. Proteases have evolved to use different mechanisms for proteolysis.,- Serine proteases, the most abundant class, cut peptide bonds in proteins using a well-known catalytic triad (His57, Asp102, Ser195). Though His57, Asp102 and Ser195 are far apart in their primary sequence, they converge in the 3D structure to form the active site. We have compared the results obtained from STEEP to those obtained from a sequence based MSA program (ClustalW) and a structure based alignment program (MUSTANG) for a set of proteins from the chymotrypsin superfamily. While the MUSTANG superimpositions of the complete proteins was superior to the one generated by STEEP (Fig. 1), it should be noted that the dispersion in the active site residues in the STEEP superimposition is less. Since, STEEP generates the MSA based on the residues in the vicinity of the active site, the alignments are based on a better spatial overlap. The results obtained from all three programs generated almost equivalent phylogenies (Fig. 2). The evolution of β-lactamases and the prevalence of antibiotic resistance is the subject of intense research and speculation., We have applied STEEP to generate the phylogenetic trees for the serine (SBL) and metallo-β-lactamase (MBL) superfamilies. These relationships have been studied previously.- Class C SBLs were hypothesized to have evolved separately from class A or class D proteins. Also, structural comparison has revealed a common fold between signal transduction proteins and class D enzymes. For MBLs, the B3 subclass has been shown to have an independent origin as compared with that B1 and B2 subclasses, based on both sequence and structural phylogeny. Other proteins from the MBL superfamily like human glyoxalase II and methyl parathion hydrolase are more related to the class B3 enzymes., We demonstrate here a confirmation of the expected structural homologies in these proteins (Fig. 3), and our results have concurred with the hypothesized relationships in these sets of proteins (Fig. 4). However, the cladogram did not indicate the expected similarity in class A enzymes and certain penicillin binding proteins (PDBid:1NZ0) possibly due to the deletion of a segment from the sequence close to the active site. Since MSA techniques are not applicable to the serine and metallo- β-lactamase superfamilies due to significant sequence divergence, we compared STEEP results for these 2 superfamilies to those generated by MUSTANG. While both methods agree on the MBL phylogenies, the cladogram generated for SBLs by MUSTANG differed from STEEP generated cladogram (the latter showing the accepted relationship). Previous work by various groups have elucidated the discriminating powers of electrostatic properties and proposed methods for identifying residues that determine specificity of homologous proteins from different species. For example, the molecular dipole of the binding site of a ligand free structure has been used to discriminate between adenine and guanine binding sites in proteins. Another work has applied electrostatic similarity indices to 100 members of the Pleckstrin homology (PH) domain family, and demonstrated that the “electrostatic properties of the PH domains are generally conserved despite the extreme sequence divergence”. Such conservation in protein superfamilies has been established by other groups as well. The electrostatic similarity index has also been applied to identify residues that are responsible for differing selectivity in the dihydrofolate reductase protein taken from different species (PIPSA). This feature is similar to the one we have described in the current work. Both STEEP and PIPSA however are dependent on being able to obtain a relevant superimposition of the target protein. A superimposition independent method has been proposed to functionally classify protein structures based on properties that are invariant of affine transformations., Such a method is particularly applicable to cases where there is little global similarity. The inability of CLASP to distinguish between mirror images is typical of methods that use RMSD. The symmetry in the potential difference in the active site of the β-lactamase superfamily is highlighted in Table S7. Here, the mirror images of the correct scaffold had marginally better CLASP scores. We filtered out such images by ensuring the proper sequence order between the querying motif and the matched residues. A caveat in inferring phylogenetic relationship through structural similarity is the phenomenon of convergent evolution that achieves the same fold through a different evolutionary pathway., The presence of a convergently evolved protein in the set might result in the detection of a homologous scaffold, but subsequently produce irrelevant results. This limitation is shared by almost all programs generating MSA of proteins, and thus requires manual inspection in pruning out unrelated proteins. Another limitation of STEEP (and other structure-only based methods) when compared with sequence based MSA methods is the requirement that the structure of the protein is to be previously known. One can use a structure prediction method to generate a likely structure to circumvent this limitation. However, the accuracy of the tool used to predict the structure needs to be kept in mind when assessing the results of such a work-flow. A quantitative comparison with such structural methods is made difficult by the lack of good metrics for benchmarking structural alignments, although a recent method proposes a mathematical framework for protein structure comparison. It is also to be noted that the STEEP methodology involves the residues in the active site, a demanding constraint that leads to non-optimal results as the distance from the active site increases. Thus, it does not fare as well as other methods (MUSTANG, as compared using RMSD values) that apply global and flexible constraints while superimposing, although the results improve considerably when amino acids are represented by Cα instead of the reactive atom. The approach adopted by STEEP is necessary in order to ensure the optimal superimposition of the active site, even at the cost of non-optimal results in other domains. By doing so, the identification of residues that are to be mutated is proper. Comparisons with sequence alignment methods suffer for the same reason, as well as the fact that the benchmarking suites have been shown to have been inadequately represented by structural information.- Finally, it is to be noted that STEEP is less automated than the other methods compared in the current work (ClustalW and MUSTANG), and requires a priori knowledge of the active site residues. Any comparison metric that favors STEEP should take this into consideration.

Conclusions

To summarize, we propose a MSA methodology that generates both evolutionary and functional relation-ships, eliminates unrelated proteins from the computation, emits a multiple superimposition of the related proteins and demonstrate that the active site vicinity contains enough information to infer correct kinship in a set of related proteins. A unique feature of STEEP is the ability to identify residues that can be targeted by directed evolution experiments in order to endow enzymatic functionality of one member of a superfamily to another.

Materials and Methods

STEEP takes a set of related proteins with known structures, such that the catalytic site is known for at least one protein (the template protein) and that the chemically active side chain consists of three or more residues. These residues are used to create a query motif for analysis using CLASP. The underlying theoretical foundation for CLASP is the non-triviality of the spatial and electrostatic congruence in cognate pairs seen across various structures of the same catalytic function. CLASP extracts matching scaffolds in these related proteins, which are then superimposed. Thus, we obtain a consolidated spatial reference frame for the set of proteins. We proceed to align the residues from the template protein that are within a certain (parameterized) radius from the residues in the active site motif other spatially close residues in other proteins, providing the user an option to use electrostatic congruence as a discriminator. These steps are now described in details.

Extracting the partial scaffolds

STEEP takes as an input a set of M related proteins (Eq. 1) with known structures and a motif consisting of N (> = 3) residues (Eq. 2) from the catalytic site of one of the proteins (P1) . Every amino acid is represented by a user defined atom. This is the atom whose electrostatic potential will be representative of that particular residue in the protein, just as the Cα atom represents the spatial coordinates while doing a RMSD analysis. Also, each position of the motif has a set of amino acids (Eq.3) specified to allow for stereochemically equivalent matches at that particular position, such that matching amino acids of type Ri should belong to Group. All sets of N residues with the above mentioned constraints are obtained in each protein Pi using an exhaustive search procedure similar to the one used in SPASM(Eq. 4). The pairwise distances and potential differences are computed in each match MatchPij for each protein Pi (i ≠ 1), and are furthermore compared with the active site motif ΦP1ASM using a scoring function (Cscore), resulting in a score which defines an ordering of the matches. Matches below a user defined threshold score (Sthresh) are discarded. In cases where the best match has a score of more than Sthresh, the protein is discarded under the assumption that it is not related to other proteins. The scaffolds for a protein Pi is defined as the motif with the least CScore - MatchPi1. The pseudocode for this function is shown in Figure S3A.

Superimposing the scaffolds

The scaffolds from all the M proteins are now superimposed extending the technique described previously for a pair of proteins to include multiple proteins. In order to superimpose 2 scaffolds, MatchP11 and MatchPi1, we apply both linear and rotational transformations for all atoms in P1 and Pi such that the first three atoms {a1, a2, a3} in MatchP11 and MatchPi1 lie on the same plane (Z = 0), a1 atoms are at the center, and a2 atoms lie on the Y axis. We iterate the pairwise superimposition for the template protein with all other proteins to obtain a multiple superimposition. This superimposition is now outputted as a Pymol formatted file, and can be viewed using Pymol. The set of proteins now have a consolidated spatial reference frame. The pseudocode for this function is shown in Figure S3B.

Generating the alignment, and proposing mutations

Finally, we proceed to align the residues from the template protein which are within a certain (parameterized) radius Rdist from the active side residues. The set of these residues is ΦP1align (Eqn. 5). The choice of the radial distance that encompasses interacting residues has to be evaluated based on the enzymes being investigated. A small radius will not include enough residues, while a large one will include irrelevant ones. We have seen that a distance greater than 4 Å gives comparable cladograms (Fig. S4). The residues in the template protein within a distance of 9 Å constitute the sequence that are used for alignment in all the examples in the current work. Next, for each protein Pi (i ≠ 1), we identify residues that are in the vicinity of each of the residues in ΦP1align, choosing the closest residue as the alignment (Eqn. 6). This is possible since we have a consolidated spatial reference frame for the set of proteins (NResPi = number of residues in protein Pi). At this stage, we provide the option to use electrostatic congruence as a discriminator (Eqn. 7). The subroutine potcon() evaluates whether the 2 atoms have potential congruence. Two kinds of alignments are obtained by either ignoring potential congruence or filtering out residues that do not have electrostatic congruenc. The pseudocode for this function is shown in Figure S3C. A comparison of these alignments identifies residues which lack electrostatic congruence, even though they occupy a spatially equivalent position in the structure. These can be the basis of mutations in directed evolution experiments designed to mirror the desired protein and its functionality/specificity.

Implementation details and third party software

The STEEP package is written in Perl on Ubuntu. Hardware requirements are modest - all results here are from a simple workstation (2GB RAM) and runtimes were a few minutes at the most. The source code and manual are made available at www.sanchak.com/steep.html. Adaptive Poisson-Boltzmann Solver (APBS) and PDB2PQR packages were used to calculate the potential difference between the reactive atoms of the corresponding proteins., The APBS parameters and electrostatic potential units were set as described previously in. The invariance in the electrostatic features (measured in structures that have been solved independently over many years) also speaks highly of the reliability of the APBS/PDB2PQR implementation. All protein structures were rendered by PyMol (http://www.pymol.org/). The alignment and cladograms images were generated using Seaview. We have used PHYML to generate phylogenetic trees from these alignments, which uses the method of maximum likelihood. The method searches for a tree with the highest probability or likelihood that would give rise to the observed data set, given a proposed model of evolution and the hypothesized history. The LG model is the chosen evolutionary model providing the amino acid replacement matrices, and is the default setting in PHYML. Although, such methods are computationally intensive, they are robust to the choice of the evolutionary model and outperform alternative techniques methods (parsimony or distance methods).

Availability of supporting data

The source code and manual are made available at www.sanchak.com/steep.html.

75 in total

1. Electrostatics of nanosystems: application to microtubules and the ribosome.

Authors: N A Baker; D Sept; S Joseph; M J Holst; J A McCammon
Journal: Proc Natl Acad Sci U S A Date: 2001-08-21 Impact factor: 11.205

2. Detection of unrelated proteins in sequences multiple alignments by using predicted secondary structures.

Authors: Mounir Errami; Christophe Geourjon; Gilbert Deléage
Journal: Bioinformatics Date: 2003-03-01 Impact factor: 6.937

3. Alignment of multiple protein structures based on sequence and structure features.

Authors: M S Madhusudhan; Benjamin M Webb; Marc A Marti-Renom; Narayanan Eswar; Andrej Sali
Journal: Protein Eng Des Sel Date: 2009-07-08 Impact factor: 1.650

Review 4. Next-generation DNA sequencing methods.

Authors: Elaine R Mardis
Journal: Annu Rev Genomics Hum Genet Date: 2008 Impact factor: 8.929

5. MISTRAL: a tool for energy-based multiple structural alignment of proteins.

Authors: Cristian Micheletti; Henri Orland
Journal: Bioinformatics Date: 2009-08-19 Impact factor: 6.937

6. Direct visualization of protease action on collagen triple helical structure.

Authors: Gabriel Rosenblum; Philippe E Van den Steen; Sidney R Cohen; Arkady Bitler; David D Brand; Ghislain Opdenakker; Irit Sagi
Journal: PLoS One Date: 2010-06-16 Impact factor: 3.240

7. Structure of PBP-A from Thermosynechococcus elongatus, a penicillin-binding protein closely related to class A beta-lactamases.

Authors: Carole Urbach; Christine Evrard; Vaidas Pudzaitis; Jacques Fastrez; Patrice Soumillion; Jean-Paul Declercq
Journal: J Mol Biol Date: 2008-12-09 Impact factor: 5.469

8. CLICK--topology-independent comparison of biomolecular 3D structures.

Authors: M N Nguyen; K P Tan; M S Madhusudhan
Journal: Nucleic Acids Res Date: 2011-05-20 Impact factor: 16.971

9. Fr-TM-align: a new protein structural alignment method based on fragment alignments and the TM-score.

Authors: Shashi Bhushan Pandit; Jeffrey Skolnick
Journal: BMC Bioinformatics Date: 2008-12-12 Impact factor: 3.169

10. MUSCLE: a multiple sequence alignment method with reduced time and space complexity.

Authors: Robert C Edgar
Journal: BMC Bioinformatics Date: 2004-08-19 Impact factor: 3.169

4 in total

1. Editorial: Evolution acting on the same target, but at multiple levels: Proteins as the test case.

Authors: Basuthkar J Rao
Journal: J Biosci Date: 2017-03 Impact factor: 1.826

2. Directed evolution induces tributyrin hydrolysis in a virulence factor of Xylella fastidiosa using a duplicated gene as a template.

Authors: Hossein Gouran; Sandeep Chakraborty; Basuthkar J Rao; Bjarni Asgeirsson; Abhaya Dandekar
Journal: F1000Res Date: 2014-09-09

3. The electrostatic profile of consecutive Cβ atoms applied to protein structure quality assessment.

Authors: Sandeep Chakraborty; Ravindra Venkatramani; Basuthkar J Rao; Bjarni Asgeirsson; Abhaya M Dandekar
Journal: F1000Res Date: 2013-11-13

4. A computational module assembled from different protease family motifs identifies PI PLC from Bacillus cereus as a putative prolyl peptidase with a serine protease scaffold.

Authors: Adela Rendón-Ramírez; Manish Shukla; Masataka Oda; Sandeep Chakraborty; Renu Minda; Abhaya M Dandekar; Bjarni Ásgeirsson; Félix M Goñi; Basuthkar J Rao
Journal: PLoS One Date: 2013-08-05 Impact factor: 3.240

4 in total