BACKGROUND: Computational identification of phylogenetic motifs helps to understand the knowledge about known functional features that includes catalytic site, substrate binding epitopes, and protein-protein interfaces. Furthermore, they are strongly conserved among orthologs, indicating their evolutionary importance. The study aimed to analyze five candidate genes involved in type II diabetic nephropathy and to predict phylogenetic motifs from their corresponding orthologous protein sequences. METHODS: AKR1B1, APOE, ENPP1, ELMO1 and IGFBP1 are the genes that have been identified as an important target for type II diabetic nephropathy through experimental studies. Their corresponding protein sequences, structures, orthologous sequences were retrieved from UniprotKB, PDB, and PHOG database respectively. Multiple sequence alignments were constructed using ClustalW and phylogenetic motifs were identified using MINER. The occurrence of amino acids in the obtained phylogenetic motifs was generated using WebLogo and false positive expectations were calculated against phylogenetic similarity. RESULTS: In total, 17 phylogenetic motifs were identified from the five proteins and the residues such as glycine, leucine, tryptophan, aspartic acid were found in appreciable frequency whereas arginine identified in all the predicted PMs. The result implies that these residues can be important to the functional and structural role of the proteins and calculated false positive expectations implies that they were generally conserved in traditional sense. CONCLUSION: The prediction of phylogenetic motifs is an accurate method for detecting functionally important conserved residues. The conserved motifs can be used as a potential drug target for type II diabetic nephropathy.
BACKGROUND: Computational identification of phylogenetic motifs helps to understand the knowledge about known functional features that includes catalytic site, substrate binding epitopes, and protein-protein interfaces. Furthermore, they are strongly conserved among orthologs, indicating their evolutionary importance. The study aimed to analyze five candidate genes involved in type II diabetic nephropathy and to predict phylogenetic motifs from their corresponding orthologous protein sequences. METHODS:AKR1B1, APOE, ENPP1, ELMO1 and IGFBP1 are the genes that have been identified as an important target for type II diabetic nephropathy through experimental studies. Their corresponding protein sequences, structures, orthologous sequences were retrieved from UniprotKB, PDB, and PHOG database respectively. Multiple sequence alignments were constructed using ClustalW and phylogenetic motifs were identified using MINER. The occurrence of amino acids in the obtained phylogenetic motifs was generated using WebLogo and false positive expectations were calculated against phylogenetic similarity. RESULTS: In total, 17 phylogenetic motifs were identified from the five proteins and the residues such as glycine, leucine, tryptophan, aspartic acid were found in appreciable frequency whereas arginine identified in all the predicted PMs. The result implies that these residues can be important to the functional and structural role of the proteins and calculated false positive expectations implies that they were generally conserved in traditional sense. CONCLUSION: The prediction of phylogenetic motifs is an accurate method for detecting functionally important conserved residues. The conserved motifs can be used as a potential drug target for type II diabetic nephropathy.
Diabetes mellitus is characterized by the metabolic disorders of carbohydrate, lipid, and protein. “Type II diabetes mellitus is one of the primary threats to human health due to increasing prevalence, chronic course and disabling complications” (1, 2). Diabetic nephropathy (DN) is a major microvascular complication that affects 30–40% of all diabeticpatients and represents a major cause of morbidity and mortality, due to a serious gradual decline in renal function (3). Several genes, proteins, and environmental factors are likely to contribute to the onset of the disease DN (4). Several candidate genes have been identified for the association with DN using case-control studies. They were selected for their positional and/or functional characteristics and the contribution of the corresponding proteins in the pathophysiological axes (5).The expression of AKR1B1 gene has been seen in human kidneys. It catalyzes the reduction of glucose to sorbitol. In hyperglycaemic condition, this pathway becomes activated by excess amount of glucose, whereas in case of normal condition, it is relatively inactive. High levels of sorbitol accumulation disrupt osmoregulation in kidney cells, which leads to kidney damage (6). ELMO1 is promoting excess transcription growth factor-β, collagen type 1, fibronectin and integrin-linked kinase expression and inhibiting cell adhesion when it is over expressed. ELMO1 is expressed in the presence of high glucose and it has a potential role in the pathogenesis of diabetic nephropathy (7). Insulin like growth factor binding proteins plays a major role in cell growth and metabolism. It influences cell adhesion and migration and interacts with α5β1. Over expression of IGFBP1 is associated with many glomerular diseases, including diabetic nephropathy (8).Ectonucleotidepyrophosphate/phosphodiesterase 1 is a candidate susceptibility gene for type 2 diabetes and obesity. It helps to catalyze the release of nucleoside 5-phosphatase from nucleotides and their products. ENPP1 is expressed in several tissues such as skeletal tissue, adipose tissue, liver and kidney tissues. This gene is a risk factor for the development of diabetic nephropathy in type 2 diabeticpatients (9). Apolipoprotein E gene is associated with susceptibility of type 1 and 2 diabetic nephropathy. A polymeric protein consists of three isoforms defined by a single amino acid substitution at two sites. The affinity of ApoE for its receptors is altered by these amino acid substitutions, thereby influencing lipid metabolism. Several studies proved that the isoforms of ApoE is associated with diabetic nephropathy (10).Computational methods to predict the function of a protein from its amino acid sequence play a major role in guiding the experimental characterization of a genome (11). Although experimental methods exist to identify sequences bound by a specific protein, they have not been widely applied, and computational approaches to ‘motif discovery’ have proven to be a useful alternative (12). A sequence-based phylogenetic motif represents a promising functional site prediction strategy. Phylogenetic Motifs (PMs) are short sequence alignment fragments; consistently correspond to functional sites defined by surface loops, active site clefts and less exposed regions (13). Structural clusters of conserved positions and the trace residues, which are alignment positions that are conserved within the phylogenetic clusters, are used to identify functional regions (14).The aim of the present study was to identify the functional region of proteins by building multiple sequence alignment between the target sequence and its sequence orthologous in order to find preferentially conserved residues. Structural verification was also done to check the accuracy of functional site prediction.
Materials and Methods
Datasets
Datasets consist of five candidate genes expressed in type II diabetic nephropathy such as aldose reductase, apolipoprotein E, engulfment and cell motility protein 1, ectonucleotide pyrophosphatase/ phosphodiesterase family member 1, Insulin-like growth factor binding protein 1 were obtained from the literature (15–19). Protein sequences corresponding to each gene were retrieved from UniprotKB database and their structures were downloaded from PDB.
Phylogenetic analysis
The dataset of orthologous protein sequences, those sharing a common ancestor were obtained using PHOG database, available at http://bioinf.fbb.msu.ru/phogs/index.html.The PHOG database is used in various areas such as comparative genomics, proteomics, and evolutionary studies (20). The obtained orthologous proteins grouped together and their sequences were aligned to compare equivalent residues using the program ClustalW (21). A global dynamic programming algorithm was used to construct an alignment for full length of the sequences. These multiple sequence alignments provide structural and functional information.
Phylogenetic motifs identification using MINER
The program MINER was used for PM identification, which is available at http://www.pmap.csupomona.edu/MINER/. It uses a sliding sequence window algorithm to evaluate comprehensively the phylogenetic similarity between each window and the complete alignment (22). The multiple sequence alignment file generated by ClustalW and three-dimensional structures of the master proteins were used as input file for the program MINER. For all the data sets, sequence window width of 5 was used.Each protein family requires a unique value to identify correctly the functional regions. Phylogenetic similarity cutoff falls between 1.5 and 2.0 was used for accurate functional site predictions (23). Therefore, the threshold value was adjusted to −1.7 for all the datasets. Partition metric (PAM) clustering algorithm was used to evaluate the optimal range of thresholds. Phylogenetic similarity was calculated using the partition metric algorithm consequently resulting in phylogenetic similarity z scores.
Results
In our study, phylogenetic approach was used for the identification of the key functional residues. The five candidate genes obtained for this study and investigated PDB structures of the proteins involved and the number of identified PMs were listed in Table 1.
Table 1
Predicted phylogenetic motif parameters of the discussed protein data set
Genes
HUGO symbol
OMIM reference
UniprotKB ID
Seqa
PSZb
PMsc
Structure investigated
Aldose reductase
AKR1B1
103880
P15121
19
−1.7
7
1ABN
Apolipoprotein E
APoE
107741
P02649
37
−1.7
5
1B68
Engulfment and cell motility gene 1
ELMO1
606420
Q92556
7
−1.7
1
2VSZ
Ectonucleotide pyrophosphatase/ Phosphodiesterase family member 1
ENPP1
173335
P22413
12
−1.7
1
2YS0
Insulin-like growth factor binding protein 1
IGFBP1
146730
P08833
10
−1.7
3
1ZT5
Number of sequences in the alignment
Phylogenetic similarity z-score threshold used in identification of the phylogenetic motifs
Number of phylogenetic motifs identified
Neighbor-Joining (NJ) method in ClustalW was used for the construction of phylogenetic tree of five proteins with its orthologs was shown in Fig. 1. The functional importance of the proteins was verified through the three-dimensional structures to highlight better PM regions. Predicted conserved regions within the structure of the five proteins were shown in Fig. 2.
Fig. 1:
The unrooted phylogenetic tree is composed of selected five-protein family with their orthologs. Colour differences within the phylogenetic tree correspond to the discussed proteins such as aldose reductase (orange), ectonucleotide pyrophosphatase/phosphodiesterase family member 1(light green), Insulin-like growth factor-binding protein 1(light pink), engulfment and cell motility protein 1 (blue) and apolipoprotein E (purple) with their orthologs
Fig. 2:
The figure shows the predicted PMS in the five proteins obtained for this study (A) aldose reductase, (B) apolipoprotein E, (C) engulfment and cell motility protein 1, (D) ectonucleotide pyrophosphatase/phosphodiesterase family member 1, (E) Insulin-like growth factor-binding protein 1 and the each identified PM within them was highlighted with different colours
Predicted highly conserved residues by WebLogo
The conserved residues were observed from a cursory examination of sequence logos. The sequence logos of the predicted motifs were shown in Fig. 3. In total, 17 conserved regions were identified for the five proteins with its orthologs. The seven PMs were identified in the protein aldose reductase and amino acid residues such as Lys21, Trp20, Thr19, Gly18, Leu17, Ala30, Glu29, Gln26, Val27, Thr28, Ile58, Gln59, Leu62, Lys61, Glu60, Gly90, Lys89, Val88, Leu87, Gly86, Asp230, Ile233, Arg232, Pro231, Val259, Ile260, Pro261, Lys262, Ser263, Val 264, Thr265, Pro266, Leu301, Ser302, Cys303, Thr304, Ser305, His306 were identified from the predicted PMs. Among these residues, lysine and threonine remains highly conserved residues whereas histidine, tryptophan, aspartic acid, and alanine identified only one of the PMs.
Fig. 3:
Sequence logos of (A–E) visually highlight the occurrence of conserved amino acid residues in the identified PMs
The three amino acid residues such as glutamic acid, leucine and alanine residues were observed as highly conserved residues among the five identified PMs in apolipoprotein E. Glycine, aspartic acid, arginine, valine, methonine, glutamine, and threonine were observed in appreciable frequency whereas serine observed in only one of the PMs. For Insulin-like growth factor binding protein 1, glutamic acid was commonly occurred as highly conserved residue in the identified 3 PMs.Proline, serine, valine, arginine and lysine were also found in appreciable frequency whereas cysteine found only one of the PMs.Only one PM was identified for engulfment and cell motility protein 1 and the amino acid residues occurred in the PM was Arg570, Arg569, Arg568, Ala567 and Asn566. Among these residues, arginine remains conserved although asparagine and alanine were also observed. Similarly, one PM was predicted in ectonucleotide pyrophosphatase/ phosphodiesterase family member 1 and the highly conserved residues like, Phe13, Arg14, Glu17, Gly16 and Cys15 were occurred in the identified PM.
Phylogenetic similarity z scores (PSZs) vs. False positive Expectations (FPEs)
To determine the probability of randomly encountering each motif, FPEs were calculated. FPE approach was used to identify the traditional motifs, which are low sequence entropy regions. Too many false positives predicted by using conservation-based approaches, were said to be satisfactory. The identified phylogenetic similarities versus FPE for the five proteins were demonstrated in Fig. 4. From the result, too many FPEs were occurred in ectonucleotide pyrophosphatase/ phosphodiesterase family member 1, aldose reductase and apolipoprotein E. So, the conserved residues predicted from these proteins can be considered as traditional motifs.
Fig. 4:
The plot demonstrates the sequence correspondence of phylogenetic (red) and traditional motifs (green). (A) aldose reductase, (B) apolipoprotein E, (C) engulfment and cell motility protein 1 (D) ectonucleotide pyrophosphatase/ phosphodiesterase family member 1 (E) Insulin-like growth factor binding protein 1
Discussion
In the present study, computational analysis of available sequences, crystal structures of the five different expressed proteins in type II diabetic nephropathy were used to identify functionally important residues. Comparing the structures of homologous proteins and analyzing of large multiple sequence alignments can help to identify sequence, structural conservation, and conserved interactions that are crucial for protein stability and function (24). Here, the three dimensional structures of the proteins were used to establish the accuracy of functional site predictions. Ortholog detection is essential for functional annotation of genomes, with applications to predicting protein-protein interactions (25). Based on phylogenetic analysis, a group of evolutionary related ortholog sequences were identified and aligned to find the phylogenetically conserved regions of proteins.In the present analysis, the occurrence of amino acid residues in the identified PMs was generated by WebLogo. The sequence logos of each identified PMs provides the highly conserved residues in the sequence. Each logo consists of stacks of letters and one stack for each position in the sequence. The overall height of each stack indicates the sequence conservation whereas the height of symbols within the stack reflects the relative frequency of the corresponding amino acid at that position (26). From the results, the amino acids such as glycine, tryptophan and aspartic acid were found in appreciable frequency whereas arginine was identified in all the predicted PMs. All the conserved positions are not related to function but some amino acids tend to have structural roles when conserved (e.g. Trp, Leu, Gly, Cys) while others (mainly a polar amino acids, or specific types e.g. Asp, Ser, Cys, His) tend to be part of binding and active sites (27). Here, the conserved residue leucine occurred in apolipoprotein E, can be responsible for the structural role of the protein. Therefore, the result suggests that these conserved positions which were predicted, can be important to the functional diversity of the proteins.Traditional motifs were predicted from the identified PMs by calculating FPE in the same sequence window, which was used to calculate PMs, and FPE describes the probability of encountering each sequence window. PMs were identified based on phylogenetic similarity, whereas FPEs were calculated based on sequence conservation. Motif-based approaches result in many false positives to be useful in large-scale analyses (28). The predicted maps clearly indicate the presence of traditional motifs by showing too many FPEs.The focus of the present study was to map conserved positions to a representative structure and orthologous sequences of the five different candidate genes expressed in type II diabetic nephropathy. The calculated PSZs versus FPE motif identification shows that phylogenetic motifs can be considered as traditional motifs. Most of the identified conserved residues were expected critically related to the function of the protein. Further investigation on these functional sites would provide a potential drug target for type II diabetic nephropathy.
Ethical considerations
Ethical issues (Including plagiarism, Informed Consent, misconduct, data fabrication and/or falsification, double publication and/or submission, redundancy, etc) have been completely observed by the authors.
Authors: Richard H Stephens; Patrick McElduff; Adrian H Heald; John P New; Jane Worthington; William E Ollier; J Martin Gibson Journal: Diabetes Date: 2005-12 Impact factor: 9.461