Literature DB >> 16689689

A contact energy function considering residue hydrophobic environment and its application in protein fold recognition.

Abstract

The three-dimensional (3D) structure prediction of proteins is an important task in bioinformatics. Finding energy functions that can better represent residue-residue and residue-solvent interactions is a crucial way to improve the prediction accuracy. The widely used contact energy functions mostly only consider the contact frequency between different types of residues; however, we find that the contact frequency also relates to the residue hydrophobic environment. Accordingly, we present an improved contact energy function to integrate the two factors, which can reflect the influence of hydrophobic interaction on the stabilization of protein 3D structure more effectively. Furthermore, a fold recognition (threading) approach based on this energy function is developed. The testing results obtained with 20 randomly selected proteins demonstrate that, compared with common contact energy functions, the proposed energy function can improve the accuracy of the fold template prediction from 20% to 50%, and can also improve the accuracy of the sequence-template alignment from 35% to 65%.

Entities: Chemical Disease

Mesh：

Substances：
Proteins

Year: 2005 PMID： 16689689 PMCID： PMC5172539 DOI： 10.1016/s1672-0229(05)03030-5

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

The knowledge of protein structures plays a very important role in understanding protein functions, studying protein-protein interactions (, reconstructing protein structures (, and performing rational drug design (. Protein structures can be determined by both experimental and computational methods. Experimental methods such as x-ray crystallography and nuclear magnetic resonance can determine the three-dimensional (3D) structure of proteins precisely; however, currently these methods are still inefficient and can only be applied to a small part of proteins (. On the other hand, computational methods can, in principle, not only overcome the shortages of experimental methods, but also assist in understanding the mechanism of protein folding (. As a result, computational methods have been studied extensively and become an effective way to analyze protein structures (. Methods for predicting the protein 3D structure can be divided into three main categories: homology modeling 6., 7., fold recognition 8., 9., and ab initio prediction (. The homology modeling methods first search for homological proteins of the target protein in a structure-known protein database, and then use the structures of the homological proteins as templates to build a structure model for the target protein. The fold recognition methods try to find a fold template for the target protein from a template library, and then construct a full structure model for the target protein based on the selected fold template. The ab initio prediction methods, which predict the structure of the target protein only based on its sequence information, calculate the energy for all possible conformations that the target sequence may fold into, and select the conformation with the lowest energy as the native conformation of the target protein. In addition, according to the information used, the protein structure prediction methods can also be classified into two classes. The first one uses the information of known protein structures and evolution, which searches for homologous proteins of the target protein first, and then builds the structural model for the target protein based on the structural information of the homological proteins. This class includes the above mentioned homology modeling and fold recognition methods based on PSI-BLAST ( or hidden Markov model 11., 12.. The second class makes use of the information about residue-residue and residue-solvent interactions, which creates energy functions by statistical or theoretical analysis first, and then uses the functions to search for the optimal structure from a structure template library or all possible conformations of the target protein. The threading ( and ab initio prediction methods ( belong to this class. Obviously, for the second class, it is crucial that the energy functions should be able to describe residue-residue or residue-solvent interactions efficiently (. In threading methods, the widely used energy functions are obtained from statistical analysis (, and most of them are based on the contact energy between residues 16., 17.. These energy functions define pairwise residue contact energy scores according to the residue-residue contact frequency occurred in known protein structures. The basic idea of these energy functions was first proposed by Tanaka and Scheraga (, and various improvements have been made by subsequent researchers, such as combing the hydrophobic property of residues 19., 20., considering the orientation anisotropy of side chains (, applying atom-level functions (, and using more complicated models (multibody models) (. In addition, the performance of applying these energy functions to protein fold recognition has also been evaluated by previous researches 24., 25., 26.. The results indicate that the energy functions merely based on the residue contact frequency are inexact, and one of the reasons might be that the influence of the hydrophobic interaction on contact energy has not been considered. In the process that protein sequences fold into advanced structures, the hydrophobic interaction is believed to be the dominant driving force (, which makes hydrophobic residues come into the core and makes hydrophilic residues tend to exist on the surface. This phenomenon indicates that the contact preference between residues relates to solvent molecules to some extent, and therefore the influence of the solution environment (hydrophobic environment) should be considered while analyzing the pairwise residue contact energy. However, some of the existing pairwise energy functions take no account of the influence of the hydrophobic interaction on the structure stability at all (, others only consider the solution influence in terms of the hydrophobic property of residues (. For threading-based protein fold recognition, these functions are not able to reflect the influence of the hydrophobic environment on the residue contact energy effectively. In this study, the preference of residues to the hydrophobic environment is analyzed by a statistical method, and an improved contact energy function that considers both the residue contact frequency and the residue hydrophobic environment is proposed, which can reflect the influence of hydrophobic interaction on protein structure stability more effectively. Furthermore, a fold recognition (threading) approach based on this energy function is developed, and the testing results demonstrate that, compared with common contact energy functions, the proposed one can improve the accuracy of protein fold recognition more effectively.

Results

The contact energy considering residue hydrophobic environment

First, a dataset consisting of the structural information of 525 proteins was collected from the Protein Data Bank (PDB) database ( for analyzing the residue contact energy and developing the energy function (see Materials and Methods). According to this dataset, the coordinates of C atoms of all residues (C atoms for glycine) were obtained and used to evaluate the distance between residues. This distance was further used to determine whether two residues were contacted with each other. Then, the solvent accessible surface areas (SASAs) of residues ( were obtained by the POPS program ( to determine the hydrophobic environment for each of the residues. Based on SASA, the hydrophobic environment was classified into three types, hydrophobic, hydrophilic, and neutral (uncertain). Finally, the information of residue distance and hydrophobic environment was integrated to determine the residue contact energy (see Materials and Methods). The contact energy was defined for each of the 400 possible residue pairs between 20 types of amino acids. Furthermore, for each residue pair, there were nine possible combinations in terms of residue hydrophobic environment. Totally, 3,600 items of residue contact energy were determined. The contact energy related to asparagines and cysteines is shown in Figure 1.

Fig. 1

Examples of the relationship between the residue contact energy and the residue hydrophobic environment. The x-axis represents twenty types of amino acids, and the y-axis represents the residue contact energy; different curves represent different hydrophobic environment. The letters H, N, and P denote hydrophobic, neutral (uncertain), and polar (hydrophilic), respectively. A. Residue contact energy related to asparagine (ASP). It can be seen that the contact energy of residue pairs containing asparagine tends to be high when the residue pairs are in the HH state (that is, both residues are in the hydrophobic position; the only exception is the pair with cysteine), but tends to be low in the HP or pH state. B. Residue contact energy related to cysteine (CYS). For some residue pairs consisting of cysteine and another residue such as cysteine, leucine, tryosin, or tyrosine, the contact energy is high when the residue pairs are in the PP state, which is distinct from that related to asparagine.

The statistical result indicates that the residue contact energy is distinct for different residue pairs or different combinations of residue hydrophobic environment. The analysis of this result reveals that: (1) For most residue pairs, their contact energy scores tend to be large when both residues are in the hydrophobic position, but small when one is in the hydrophobic position and the other in the hydrophilic position. (2) The contact energy scores for different residue pairs are quite different from each other. (3) The contact energy scores of some residue pairs are special. For example, the distributions of contact frequency between cysteine and other residues tend to be random, but the contact probability is high when two cysteine residues appear in the hydrophilic environment simultaneously (Figure 1B).

The prediction accuracy of applying the contact energy to threading

In this study, the proposed contact energy was applied to protein fold recognition using the threading method, and a dataset consisting of 20 randomly selected proteins was used to test the prediction accuracy. The PDB identifiers of these proteins are shown in Table 1. For the purpose of comparison, the prediction accuracy of commonly used contact energy was also tested. Three measures, including self-template prediction accuracy, sequence-template alignment accuracy, and native alignment score, were used to evaluate the prediction performance.

Table 1

Features of the Twenty Testing Proteins

PDB ID	Secondary structure		No. of S–S bonds	Group
PDB ID	No. of α-helix	No. of β-sheet	No. of S–S bonds	Group
1ahu	3	0	0	Better
1cuk	3	0	0	Better
1hry	3	0	0	Better
1ahl	0	3	3	Better
1aw6	2	0	0	Better
1fre	1	2	0	Better
1r2a	2	0	0	Better
1lea	3	0	0	Better
1cbn	2	2	2	Better
1co4	2	2	0	Better
1auu	0	3	0	Better
1sei	1	3	0	Better
1nkl	5	0	3	Better
1neq	5	0	0	Better
1tpm	0	3	1	Equal
1ehs	2	0	2	Worse
1fd4	1	3	3	Worse
1mkn	0	3	3	Worse
1hyk	0	2	2	Worse
1chc	1	3	0	Worse

Self-template prediction accuracy

To evaluate the accuracy of self-template prediction, z-scores ( of the alignments between the target protein sequence and each template in a template library were calculated and used to rank the templates, then the position of the target template in the ranked templates can reflect the accuracy of self-template prediction. The testing results indicate that, compared with the common energy function, the improved energy function can perform better for 14 out of the 20 test proteins (70%). The percentage of testing proteins whose z-scores are ranked within the top 10%, 25%, and 50% of all library templates are given in Table 2.

Table 2

Percentages of the Testing Proteins with Different z-score Ranks

Energy function	Top 10%	Top 25%	Top 50%
Common energy function	20%	55%	85%
Improved energy function	50%	80%	95%

Sequence-structure alignment accuracy

The accuracy of the alignment between target sequences and their own structures, which judges whether the optimal alignments are consistent with actual situations, was also used to evaluate the fold recognition effect. As there were certain alignment errors, shifts from the exact alignment within four residues were counted as correct alignments (. In this test, 7 out of 20 proteins (35%) were aligned correctly using the common energy function, while the percentage was 65% for the improved energy function.

Native alignment score

The effect of energy functions can also be evaluated by the difference of energy scores between the sequence-template native alignments and random alignments. The energy scores of the sequence-template native alignments were obtained by aligning the residues of target sequences to their own positions in the templates, and the energy scores of random alignments were calculated by randomly aligning the residues of target sequences to the templates. For each target sequence, 1,000 random alignments were made, and the average score of these alignments was used for this target sequence. In this test, when the improved energy function was used, the native alignment scores of 75% of proteins were higher than their average scores of random alignments, and this figure was only 50% for the common energy function.

Discussion

The above testing results demonstrate that the contact energy function combining with the hydrophobic environment is superior to the common energy function, indicating that the hydrophobic environment not only relates to the residue contact energy, but also influences the accuracy of fold recognition. In order to analyze whether the prediction accuracy was correlated with protein structure features, we divided the 20 testing proteins into three groups, namely “Better”, “Equal”, and “Worse”, according to the template prediction accuracy. The “Better” group means that the template prediction accuracy using the improved energy function was better than using the common energy function, the “Worse” group refers to the contrary situations, and the “Equal” group represents that the prediction accuracy was equal. Then, the secondary structures and the number of disulfide bonds in these proteins were analyzed. The results are given in Table 1. As can be seen from Table 1, most of the proteins in the “Better” group belong to the α class or α/β class, while most of the proteins in the “Worse” group belong to the β class. In addition, the prediction accuracy also relates to the number of disulfide bonds in proteins. The reasons of the above phenomenon might be: when α-helix forms its compact conformation, it will be driven by outside forces like the repulsion and attraction of solvent molecules, consequently, the influence of hydrophobic interaction is more significant for proteins consisting of α-helix; on the contrary, as β-sheet is not so tight as α-helix, it may be stabilized by residue interaction, therefore it is unnecessary to consider the hydrophobic interaction. However, the optimal alignment between target sequences and their own templates may not be exact, which might be caused by the following reasons: (1) The contact energy is derived from statistical analysis, which only reflects the statistically possible interactions between residues. (2) The optimal alignment is obtained by global alignment, which cannot guarantee that all local alignments are optimal. (3) For a specific residue, the residues around it may have similar characters with it.

Materials and Methods

Dataset

A total of 525 proteins were selected from the PDB database for analyzing the contact energy function, which meet the following criteria: (1) None of the identity between each other is less than 30%. (2) The structure is determined by x-ray crystallography. (3) The resolution is better than 2.0 Å. (4) The sequence consists of 30–750 amino acids.

Residue hydrophobic environment

The residue hydrophobic environment can be determined by the SASA of residue (, which is defined as the center area traced out by solvent molecules as they roll over the exposure surface of residues in the solvent (. A small value of SASA means that the residue tends to be in the hydrophobic environment, otherwise, it tends to be in the hydrophilic environment. There are many programs for calculating SASA 30., 33., most of them are based on the atom coordinates submitted by users. In this research, a freely available program POPS ( was used. Based on the SASA value, we classified the environment of the residues into three types: hydrophobic, hydrophilic, and neutral (uncertain).

Contact energy function

Common energy function

The common contact energy function is based on the contact preference of residues. The contact energy between residues a and a is defined as where p (a, a; r) is the probability that the distance between a and a is less than the designated cutoff value r, and p0(a, a; r) is the expected probability correspondingly. During fold recognition, the residues of the target protein are first placed onto templates by some alignment methods, then the score of the alignment can be determined by where (i, j) is a site pair of the template; a and a are residue types on sites i and j; σ(x) is equal to 0 when x > 0 and is equal to 1 when x ≤ 0; and r is the distance between sites i and j. The energy functions based on the above idea are still broadly used in fold recognition methods.

Improved energy function

In this study, an improved energy function is proposed, which concerns not only the residue type, but also the residue hydrophobic environment. This energy function is given by where p(a, a; env, env) is the probability that a and a contact with each other in the hydrophobic environment env and env, respectively, and p(a, a; env, env) is the expected probability. The distance between residues is measured by the distance between C atoms (C atoms for glycine), and the cutoff r is set as 7.5 Å. Similar to Equation (2), the energy score of the sequence-template alignment is defined as where (T, T) denotes the hydrophobic environment of sites i and j in the template.

Protein fold recognition based on threading

Threading is an efficient method for protein fold recognition, which can be used to evaluate the structural similarity among proteins with low sequence identity. The threading process realized in this study is shown in Figure 2, which consists of the following steps:

Fig. 2

The flow chart of the threading process in this study.

1. Build the fold template library. Based on the classification of the CATH database (, 438 fold families and their representative structures were selected, resulting in a fold template library containing 438 structures. 2. Determine residue contact energy. The residuecontact energy scores were utilized to calculate thealignment score between target sequences and structure templates, which was determined by (2), (4). 3. Align target sequences to structure templates. Because the sequence lengths of the target protein and template proteins are usually different and gaps are allowed in the sequence-structure alignment, it would be an NP-problem to search the optimal alignment. To solve this problem, the divide-and-conquer algorithm (, an approximate global optimal searching algorithm, was adopted in this study. 4. Determine the optimal template. The fittest template of target sequences was determined by z-score (.

33 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Protein structure prediction and structural genomics.

Authors: D Baker; A Sali
Journal: Science Date: 2001-10-05 Impact factor: 47.728

Review 3. Protein folding and misfolding.

Authors: Christopher M Dobson
Journal: Nature Date: 2003-12-18 Impact factor: 49.962

4. Studies on the gross structure, cross-linkages, and terminal sequences in ribonuclease.

Authors: C B ANFINSEN; R R REDFIELD; W L CHOATE; J PAGE; W R CARROLL
Journal: J Biol Chem Date: 1954-03 Impact factor: 5.157

Review 5. A structural perspective on protein-protein interactions.

Authors: Robert B Russell; Frank Alber; Patrick Aloy; Fred P Davis; Dmitry Korkin; Matthieu Pichaud; Maya Topf; Andrej Sali
Journal: Curr Opin Struct Biol Date: 2004-06 Impact factor: 6.809

6. Topology fingerprint approach to the inverse protein folding problem.

Authors: A Godzik; A Kolinski; J Skolnick
Journal: J Mol Biol Date: 1992-09-05 Impact factor: 5.469

7. Can contact potentials reliably predict stability of proteins?

Authors: Jainab Khatun; Sagar D Khare; Nikolay V Dokholyan
Journal: J Mol Biol Date: 2004-03-05 Impact factor: 5.469

Review 8. Development of novel statistical potentials for protein fold recognition.

Authors: N-V Buchete; J E Straub; D Thirumalai
Journal: Curr Opin Struct Biol Date: 2004-04 Impact factor: 6.809

9. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.

Authors: J Park; K Karplus; C Barrett; R Hughey; D Haussler; T Hubbard; C Chothia
Journal: J Mol Biol Date: 1998-12-11 Impact factor: 5.469

Review 10. Knowledge-based prediction of protein structures and the design of novel molecules.

Authors: T L Blundell; B L Sibanda; M J Sternberg; J M Thornton
Journal: Nature Date: 1987 Mar 26-Apr 1 Impact factor: 49.962

2 in total

1. Estimating the Influence of Physicochemical and Biochemical Property Indexes on Selection for Amino Acids Usage in Eukaryotic Cells.

Authors: Giovani B Fogalli; Sergio R P Line
Journal: J Mol Evol Date: 2021-03-24 Impact factor: 2.395

2. HORIBALFRE program: Higher Order Residue Interactions Based ALgorithm for Fold REcognition.

Authors: Pandurangan Sundaramurthy; Raashi Sreenivasan; Khader Shameer; Sunita Gakkhar; Ramanathan Sowdhamini
Journal: Bioinformation Date: 2011-12-10

2 in total