Literature DB >> 16689690

A branch and bound algorithm for the protein folding problem in the HP lattice model.

Abstract

A branch and bound algorithm is proposed for the two-dimensional protein folding problem in the HP lattice model. In this algorithm, the benefit of each possible location of hydrophobic monomers is evaluated and only promising nodes are kept for further branching at each level. The proposed algorithm is compared with other well-known methods for 10 benchmark sequences with lengths ranging from 20 to 100 monomers. The results indicate that our method is a very efficient and promising tool for the protein folding problem.

Entities: Disease Species

Mesh：

Substances：
Amino Acids
Proteins

Year: 2005 PMID： 16689690 PMCID： PMC5172541 DOI： 10.1016/s1672-0229(05)03031-7

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

The protein folding problem, or the protein structure prediction problem, is one of the most interesting problems in biological science. Studies have indicated that proteins’ biological functions are determined by their dimensional folding structures. Because the structure of a protein is strongly correlated with the sequence of amino acid residues, predicting the native conformation of a protein from its given sequence is a feasible approach and is of great significance for the protein engineering. Since the problem is too difficult to be approached with fully realistic potentials, the theoretical community has introduced and examined several highly simplified models. One of them is the HP model of Dill et al. 1., 2., 3. where each amino acid is treated as a point particle on a regular (quadratic or cubic) lattice, and only two types of amino acids—hydrophobic (H) and polar (P)—are considered. Although the HP model is extremely simple, it still captures the essence of the important components of the protein folding problem (. The protein folding problem in the HP model has been shown to be NP-complete, and hence unlikely to be solvable in polynomial time 5., 6., 7.. For relatively short chains, an exact enumeration of all the conformations is possible. In dealing with longer chains, however, more efficient approximation algorithms are certainly desirable. The methods used to find low energy structures of the HP model include genetic algorithm (GA; ref. 8., 9., 10., 11., 12.), Monte Carlo (MC; ref. 10., 12.), simulated annealing (, etc. These algorithms can find optimal or near-optimal energy structures for most benchmark sequences, however, their computation time is rather long. In this paper, a branch and bound algorithm is proposed to find the native conformation for the two-dimensional (2D) HP model. The experimental results have shown that our algorithm is very efficient, which can find optimal or near-optimal conformations in a very short time for a number of sequences with lengths ranging from 20 to 100 monomers.

Model

Let us consider this problem in 2D Euclidean space. The monomers are numbered consecutively from 1 to n along the chain, which is folded on the square lattice, and each monomer occupies one site with the center on the lattice point. Note that each monomer should be connected to its chain neighbors and is unable to occupy a site filled by other monomers. If monomer i is placed on the square lattice, then the coordinates of its location are denoted by (x, y). The HP model is based on the assumption that the hydrophobic interaction is one of the fundamental principles in the protein folding. An attractive hydrophobic interaction provides for the main driving force for the formation of a hydrophobic core that is screened from the aqueous environment by a shell of polar monomers. Therefore, the energy function of the HP model is defined as: where σ = 1 if the ith monomer in the chain is hydrophobic, otherwise σ = 0. In other words, the energy of a conformation can be obtained by counting the number of adjacent pairs of hydrophobic monomers (H–H) that are not consecutively numbered, and multiplying by —1. The goal of the protein folding problem is to find the conformation with the minimal energy. Figure 1 shows a folding conformation of sequence HPPHPPHPHPPHP on the 2D square lattice. It can be seen that each monomer occupies one lattice site connected to its chain neighbors. The energy of this conformation is —4, which is the lowest energy state of the sequence. Obviously, there is a compact hydrophobic core in the folded conformation.

Fig. 1

The lowest energy conformation with E = —4 of sequence HPPHPPHPHPPHP. Black point particle: hydrophobic (H); White point particle: polar (P).

Algorithm

In our algorithm, a conformation is built by adding a new monomer at an allowed neighbor site of the last placed monomer on the square lattice. In order to obtain a self-avoiding conformation, an already occupied neighbor should not be considered. The monomers are placed consecutively until all the n (the length of the chain) monomers are placed, that is, our algorithm is a growth algorithm. If k − 1 (1 ≤ k ≤ n) monomers have been placed on the square lattice, the kth monomer may have three possible locations: turn 90° right, turn 90° left, or continue ahead. Figure 2 gives a partial conformation where four monomers have been placed on the square lattice. It can be seen that there are three unoccupied positions neighboring to Monomer 4. The next monomer, namely Monomer 5, can be placed at any one of these unoccupied positions, resulting in three different partial conformations accordingly. In this way, all possible folding conformations of a sequence can be enumerated. As shown in Figure 3, a search tree representation can be used to denote all possible folding conformations, with three descendants at most for each node. Each node in the search tree corresponds to a partial conformation, and a line between two nodes represents a placement choice of a new monomer to the existing partial conformation. Consequently, leaf nodes at the end of the tree correspond to the complete conformation.

Fig. 2

The three possible positions for Monomer 5.

Fig. 3

A representation of the search tree.

From Figure 3, it is obvious that the conformational space grows exponentially when the length of the protein chain increases. As mentioned by Unger and Moult (, the number of possible (self-avoiding) conformations for an L-long sequence on a 2D square lattice is AμL, where μ ≈ 2.63 and γ ≈ 0.333. Accordingly, for a protein chain of not too short length, the search space is too huge to find the lowest energy conformation within a reasonable running time. To reduce the computational cost, a so-called branch and bound method is introduced in this paper. In this search method, only the promising nodes are kept for further branching and the remaining nodes are pruned off permanently. Since a large part of the search tree is pruned off aggressively to obtain a solution, its running time is polynomial in the size of the problems. In our algorithm, we treat H monomers and P monomers differently. For a partial conformation where k−1 monomers have been placed on the square lattice, if the kth monomer is P, then all possible branches should be kept. Otherwise, if the kth monomer is H, then the benefit of all possible branches of the kth monomer will be evaluated and some branches may be pruned. That is to say, the main part of our algorithm is centered on the evaluation and pruning of the H monomers. This strategy maintains the diversity of the conformations and eliminates the hopeless partial conformation at the same time. The details are as follows: We set two variables, U and Z, as the thresholds to evaluate the benefit of all branches for monomer k. Here, U is defined as the lowest energy of the partial conformation with length k that has ever been generated so far, and Z is the arithmetic average energy of the partial conformation with length k so far. After pseudo-placing monomer k at a possible location, we calculate E, which is defined as the energy of the current partial conformation with k monomers placed. It should be pointed out that the term “pseudo-place” means that it is just a test and the placing process can be reverted. Then we compare E with thresholds U and Z: If E ≤ U, it means that this partial conformation is very promising and this branch should be kept. If E > Z, that means the benefit of the partial conformation is below the average, so this conformation is discarded with probability ρ1 Otherwise, if Z ≥ E > U, the partial conformation is discarded with probability ρ2. The pseudo-code of this subroutine is presented in Figure 4, including the details of evaluation criterion and the pruning mechanism, which is the main part of our algorithm.

Fig. 4

The pseudo-code of the subroutine in the branch and bound algorithm.

The above process is implemented in a recursive way until all the conformations are either pruned or hit length n. From the conformations hitting length n, we choose one with the lowest energy as the output of the algorithm. It should be mentioned that the search could be implemented by depth-first or breadth-first, where the two results are identical. In this paper, our algorithm is implemented by depth-first. Here, E is the minimal energy of the complete conformations ever built. Note that the first two monomers of a chain can be placed on the square lattice randomly. Therefore, the input parameters are k = 3, E2 = 0. The initial values of the two thresholds U and Z are both 0. Obviously, if ρ1 = 0 and ρ2 = 0, the search space will be the complete tree (no node be pruned) and it will take a prohibitively long time to search for the lowest energy conformation. If ρ1 = 1 and ρ2 = 1, it takes a very little time to search the entire search space because the thresholds are so high that many promising nodes may be discarded. That is to say, the higher the value of the probabilities, the more difficult a branch is to be kept. Therefore, choosing the value of ρ1 and ρ2 is an essential factor affecting the speed and efficiency of this approach. In this paper, we let ρ1 = 0.8 and ρ2 = 0.5. The probability ρ2 is chosen to be less than ρ1 because a partial conformation with energy below average is more promising than a high energy partial conformation. In this way, E, the energy of the partial conformation, can be viewed as the energy expectation of the partial conformation after looking one step ahead and Z is expressed as the mean energy of the already generated partial conformations of length k. Z keeps a historical record, which is, to a large extent, conducive to the formulation of promising conformations. For any partial conformation, it would have more opportunities to procreate if holding higher individual quality (E), which is in accordance with the law of natural selection.

Validation

To test the performance of the branch and bound algorithm, we compared it with the MC, GA, and mixed search (MS; ref. ) algorithms by using 10 benchmark sequences for evaluation (Table 1).

Table 1

The 10 Benchmark Sequences for Algorithm Evaluation

Length	Sequence
20	HPHPPHHPHPPHPHHPPHPH
24	HHPPHPPHPPHPPHPPHPPHPPHH
25	PPHPPHHPPPPHHPPPPHHPPPPHH
36	PPPHHPPHHPPPPPHHHHHHHPPHHPPPPHHPPHPP
48	PPHPPHHPPHHPPPPPHHHHHHHHHHPPPPPPHHPPHHPPHPPHHHHH
50	PPHPPHPHPHHHHPHPPPHPPPHPPPPHPPPHPPPHPHHHHPHPHPHPHH
60	PPHHHPHHHHHHHHPPPHHHHHHHHHHPHPPPHHHHHHHHHHHHPPPPHHHHHHPHHPHP
64	HHHHHHHHHHHHPHPHPPHHPPHHPPHPPHHPPHHPPHPPHHPPHHPPHPHPHHHHHHHHHHHH
85	HHHHPPPPHHHHHHHHHHHHPPPPPPHHHHHHHHHHHHPPPHHHHHHHHHHHHPPPHHHHHHHHHHHHPPPHPPHHPPHHPPHPH
100	PPPHHPPHHHHPPHHHPHHPHHPHHHHPPPPPPPPHHHHHHPPHHHHHHPPPPPPPPPHPHHPHHHHHHHHHHHPPHHHPHHPHPPHPHHHPPPPPPHHH

Table 2 presents the results obtained by the four methods on the 10 different sequences. As shown in the table, our branch and bound algorithm can find the optimal lowest energy conformations for six sequences. It is noteworthy that our algorithm can find one native state for the sequence of length 60, whereas the other three methods failed. For the two long sequences of length 85 and 100, respectively, our algorithm can find near-optimal energy conformations. It should be pointed out that predicting the longest sequence of length 100 is a hard problem, whose native state can only be obtained by a few methods such as the PERM algorithm 14., 15. and the guided simulated annealing method (.

Table 2

Performance Comparison of the Four Algorithms *

Length	Optimal	MC	GA	MS	BB
20	−9	−9	−9	−9	−9
24	−9	−9	−9	−9	−9
25	−8	−7	−8	−8	−8
36	−14	−12	−14	−14	−14
48	−23	−18	−22	−22	−22
50	−21	−19	−21	−21	−21
60	−36	−31	−34	−34	−36
64	−42	−31	−37	−38	−38
85	−53	N/A	N/A	N/A	−52
100	−50	N/A	N/A	N/A	−48

Performance comparison on finding the lowest energy conformations of the four algorithms, including Monte Carlo (MC), genetic algorithm (GA), mixed search (MS), and branch and bound (BB).

We did not compare the speed with other methods directly because the machines were different. Moreover, the running time of the other three methods was presented in terms of “number of steps” while the exact CPU time was used in our test. All the computations in this study were carried on a 2.4 GHz PC with 512 M memory. The CPU time for all sequences was less than 10 s except the sequence of length 64, for which the CPU time was 39.46 s. It can be seen from Unger and Moult ( that the “number of steps” of MC and GA methods increases badly with the increase of sequence lengths, therefore, it is imaginable that the computational speed of MC and GA methods in Unger and Moult ( for practical applications is unacceptable. The resulting folding conformations for sequences with 24, 36, 60, 85, and 100 monomers are given in Figure 5, respectively. For sequences with 24, 36, and 60 monomers, the corresponding conformations are all of the lowest energy. For the other two sequences with longer lengths, the corresponding conformations are also of near-optimal energy. It can be seen that the conformation has a single compact hydrophobic core for all sequences, which is analogous to the real protein structure.

Fig. 5

The lowest energy states of the sequences with length n = 24, 36, 60, 85, and 100, respectively.

Conclusion

The branch and bound algorithm proposed in this paper is a novel and effective tool for the conformational search in the low-energy regions of the protein folding problem in the 2D HP model. The experimental results on 10 benchmark sequences demonstrate that our algorithm outperforms other three methods in terms of speed and efficiency. Our algorithm is similar to the “population control” scheme ( where individuals would have more opportunities to procreate if holding higher individual quality, and the pruning mechanism reduces considerably the computational burden of search. This is the root reason why our approach yields high efficiency. With slight modification, this algorithm can be extended for the 3D version. We should point out that, the coding of this algorithm is very simple and hence it can be easily implemented by practitioners.

11 in total

1. Improving genetic algorithms for protein folding simulations by systematic crossover.

Authors: R König; T Dandekar
Journal: Biosystems Date: 1999-04 Impact factor: 1.973

2. Guided simulated annealing method for optimization problems.

Authors: C I Chou; R S Han; S P Li; Ting-Kuo Lee
Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2003-06-10

3. Theory for protein mutability and biogenesis.

Authors: K F Lau; K A Dill
Journal: Proc Natl Acad Sci U S A Date: 1990-01 Impact factor: 11.205

4. Cooperativity in protein-folding kinetics.

Authors: K A Dill; K M Fiebig; H S Chan
Journal: Proc Natl Acad Sci U S A Date: 1993-03-01 Impact factor: 11.205

5. On the complexity of protein folding.

Authors: P Crescenzi; D Goldman; C Papadimitriou; A Piccolboni; M Yannakakis
Journal: J Comput Biol Date: 1998 Impact factor: 1.479

6. Protein folding in the hydrophobic-hydrophilic (HP) model is NP-complete.

Authors: B Berger; T Leighton
Journal: J Comput Biol Date: 1998 Impact factor: 1.479

7. Theory for the folding and stability of globular proteins.

Authors: K A Dill
Journal: Biochemistry Date: 1985-03-12 Impact factor: 3.162

Review 8. Principles of protein folding--a perspective from simple exact models.

Authors: K A Dill; S Bromberg; K Yue; K M Fiebig; D P Yee; P D Thomas; H S Chan
Journal: Protein Sci Date: 1995-04 Impact factor: 6.725

9. Reduced representation model of protein structure prediction: statistical potential and genetic algorithms.

Authors: S Sun
Journal: Protein Sci Date: 1993-05 Impact factor: 6.725

10. Genetic algorithms for protein folding simulations.

Authors: R Unger; J Moult
Journal: J Mol Biol Date: 1993-05-05 Impact factor: 5.469

3 in total

1. Study of peptide fingerprints of parasite proteins and drug-DNA interactions with Markov-Mean-Energy invariants of biopolymer molecular-dynamic lattice networks.

Authors: Lázaro Guillermo Pérez-Montoto; María Auxiliadora Dea-Ayuela; Francisco J Prado-Prado; Francisco Bolas-Fernández; Florencio M Ubeira; Humberto González-Díaz
Journal: Polymer (Guildf) Date: 2009-06-03 Impact factor: 4.430

2. On the characterization and software implementation of general protein lattice models.

Authors: Alessio Bechini
Journal: PLoS One Date: 2013-03-29 Impact factor: 3.240

3. Scoring function for DNA-drug docking of anticancer and antiparasitic compounds based on spectral moments of 2D lattice graphs for molecular dynamics trajectories.

Authors: Lázaro G Pérez-Montoto; Lourdes Santana; Humberto González-Díaz
Journal: Eur J Med Chem Date: 2009-06-17 Impact factor: 6.514

3 in total