Literature DB >> 19052663

Algorithm to find distant repeats in a single protein sequence.

Nirjhar Banerjee¹, Rangarajan Sarani, Chellamuthu Vasuki Ranjani, Govindaraj Sowmiya, Daliah Michael, Narayanasamy Balakrishnan, Kanagaraj Sekar.

Abstract

Distant repeats in protein sequence play an important role in various aspects of protein analysis. A keen analysis of the distant repeats would enable to establish a firm relation of the repeats with respect to their function and three-dimensional structure during the evolutionary process. Further, it enlightens the diversity of duplication during the evolution. To this end, an algorithm has been developed to find all distant repeats in a protein sequence. The scores from Point Accepted Mutation (PAM) matrix has been deployed for the identification of amino acid substitutions while detecting the distant repeats. Due to the biological importance of distant repeats, the proposed algorithm will be of importance to structural biologists, molecular biologists, biochemists and researchers involved in phylogenetic and evolutionary studies.

Entities: Chemical Gene Species

Keywords: distant repeats; genome sequences; phylogeny; point accepted mutation; structure-function relationship

Year: 2008 PMID： 19052663 PMCID： PMC2586129 DOI： 10.6026/97320630003028

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

Distant repeats are evolved by duplication and recombination of genes, which gives rise to amino acid repeats within the protein sequence [1]. Andrade and co-workers examined the important differences between certain protein families in order to study their evolution, structure, and function [2]. To understand more about the structural and functional relationship of the repeats in protein sequences, many types of repeats such as Ankyrin repeats, Armadillo repeats, HEAT repeats, TPR repeats, HAT repeats, Kelch repeats, Leucine-rich repeats etc., have been studied in detail [3]. To get better understanding of these repeats, several algorithms have been derived to find amino acid repeats in protein sequences [4-9]. One of the pioneers in automatic recognition of repeats using computer methods three decades ago was McLachlan [10]. McLachlan and Stewart used Fourier transform analysis techniques (autocorrelation techniques) [4,11] which has been followed by the development of different techniques and algorithms by other groups to find the distant repeats amino acid residues. These algorithms detect patterns of amino acid distant repeats in protein sequences. Heringa and Argos introduced a method for the determination of distant repeats in protein sequences. This method looks for internal similarities by comparing the protein sequence to itself with standard sequence–sequence alignment techniques [5]. Morgenstern and coworkers introduced an algorithm where a multiple sequence alignment based on segment-to-segment comparison is made to find the local similarities of amino acids in protein sequence [6,12,13]. Further, Andrade and co-workers have derived a homology-based algorithm for the identification of protein repeats using statistical significance [3]. There are other servers such as REPPER [7], REPRO [8], RADAR [9], etc., which uses different algorithms for finding repeats in protein sequences. However, most of the above mentioned algorithms have restrictions in the number of residues provided in the query sequence and does not give an easily interpretable result. To overcome this, an algorithm has been derived from the recent algorithm FAIR [14] for finding the distant repeats in a protein sequence. The proposed algorithm utilizes the PAM (Point Accepted Mutation) matrix to calculate distant repeats. According to Dayhoff and coworkers, the residue pairs, with scores above one, replace each other more often as alternatives in related sequences than in random sequences during evolution. This is an indication that both the residues may carry out similar functions. A score exactly equal to one indicates amino acid pairs that are found as alternatives at exactly the frequency predicted by chance. Residue pairs with scores less than one replace each other less often in random sequences and would be an evidence for these residues to be functionally disparate. PAM250 matrix is chosen, by default, because at this evolutionary distance (250 substitutions per hundred residues) only one amino acid in five remains unchanged [15]. Thus, the proposed algorithm prevails over the constraints that have been limiting the previous algorithms.

Methodology

The algorithm finds all possible distant amino acid sequence repeats in a given protein sequence. Two amino acid strings are considered repeats, if their corresponding residues are either identical or have a positive PAM matrix score (greater than or equal to 1). Therefore the strings ‘KLN’ and ‘QLD’ are distant repeats based on PAM250 matrix (K with Q has a score of 1 and N with D has a score of 2 based on PAM250 matrix). All possible available PAM matrices are incorporated and are downloaded from the EMBL website (http://eta.embl-heidelberg.de:8000/misc/mat/). The user has the option to choose a specific PAM matrix to find the distant repeats in a particular protein sequence. The algorithm proposed here is derived from the recent algorithm, FAIR [14] and the details are described here:

Finding matches based on PAM matrix

Initially the protein sequence is stored in a string a1. The algorithm follows the same approach in finding repeats as in FAIR except that instead of finding an exact match, it looks for matches based on the PAM matrix scores. The algorithm takes the choice of matrix from the user. Then for each set of element (a1[i],a2[j]), it checks the corresponding PAM matrix whether the score is greater than one. ‘pamvalue’ is a Boolean character that shows true for a match and false when no such match exists. Thus, if the user gives PAM250 matrix as the choice the corresponding code will be as shown in illustration 1 in supplementary material).

Storing subsequences and repeat positions

After completion of the first part of the algorithm, the ‘end-points’ as well as the length of the repeats have been stored. The next part can be explained with the help of the figure (Figure 1) where the same strings ‘KLN’ and ‘QLD’ have been taken as an example. As shown in Figure 1, the array ‘startd’ contains the positions of the starting point of the ‘first sequence’ and the ‘second sequence’. Similarly, the array ‘endd’ contains the positions of the end points of the two sequences. The manner in which the algorithm stores the repeat sequence and the starting points and end points in the vector ‘vsubseq’ is identical to that of FAIR [14]. Then the algorithm sorts the repeats to remove the identical entries so as to produce non-redundant output of distant repeats.

Figure 1

Alignment of subsequences (KLN and QLD) to detect the distant repeats in a protein sequence.

Discussion

Case study 1

The sample output shown below is for the input protein sequence taken from Homo sapiens. The number of amino acid residues present in the input sequence is 2413. The minimum number of amino acids required to be present in a given distant repeat is 100 and the calculation is performed using the scores of PAM250 matrix. As can be seen, when the minimum number was set to 100, there are seven significant distant repeats with a minimum of 108 and maximum of 257 amino acid residues. Whereas, when the minimum number of amino acid residues in a repeat was set to 50, a significant set of repeats was identified. The algorithm produces four distant repeats of length 64 residues and implies that these domains of repeat in Homo sapiens (Figure 2) should have originated due to duplication events and may be involved in any eminent biological function.

Figure 2

An illustration of case study 1 is shown. (a) input sequence to the algorithm; (b) and (c) output results.

Case study 2

The sample output shown in Figure 3 is for the input protein sequence taken from Streptococcus pneumoniae TIGR4. The number of amino acid residues present in the input sequence is 857. The minimum number of amino acids required to be present in a given distant repeat is 100 and the calculation is performed using the scores of PAM250 matrix. The case study in Figure 3 using the protein sequence from Streptococcus pneumoniae gives a total number of two distant repeats, when the minimum number of amino acid repeats was set to 100; two significant distant repeats with a minimum of 156 and a maximum of 308 amino acid residues were found. As can be seen, the distant repeat containing 156 amino acid residues is a sub-set of the other distant repeat containing 308 amino acid residues.

Figure 3

An illustration of case study 1 is shown with input sequence to the algorithm and output results.

Conclusion

An algorithm has been proposed to identify all the distant repeats present in a given protein sequence. PAM matrix scores are deployed for the identification of the distant repeats. Identification of such repeats in a protein sequence would aid the researchers to study the correlation of distant repeats with respect to their structure and function in the evolutionary process. Thus, distant repeats can be exploited to study the individual protein by their evolutionary conserved repeats and for modeling the three-dimensional structure of unknown proteins by their similar folding topology.

13 in total

Algorithm to find distant repeats in a single protein sequence.

Background

Methodology

Finding matches based on PAM matrix

Storing subsequences and repeat positions

Discussion

Case study 1

Case study 2

Conclusion

1. A census of protein repeats.

2. Evolution of bHLH transcription factors: modular evolution by domain shuffling?

Review 3. Protein repeats: structures, functions, and evolution.

4. Rapid automatic detection and alignment of repeats in protein sequences.

5. DIALIGN: finding local similarities by multiple sequence alignment.

6. Multiple DNA and protein sequence alignment based on segment-to-segment comparison.

7. Analysis of periodic patterns in amino acid sequences: collagen.

8. A method to recognize distant repeats in protein sequences.

9. Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551 .

10. REPPER--repeats and their periodicities in fibrous proteins.

1. A method to find palindromes in nucleic acid sequences.