Literature DB >> 20461159

A decision tree model for the prediction of homodimer folding mechanism.

Abishek Suresh¹, Velmurugan Karthikraja, Sajitha Lulu, Uma Kangueane, Pandjassarame Kangueane.

Abstract

The formation of protein homodimer complexes for molecular catalysis and regulation is fascinating. The homodimer formation through 2S (2 state), 3SMI (3 state with monomer intermediate) and 3SDI (3 state with dimer intermediate) folding mechanism is known for 47 homodimer structures. Our dataset of forty-seven homodimers consists of twenty-eight 2S, twelve 3SMI and seven 3SDI. The dataset is characterized using monomer length, interface area and interface/total (I/T) residue ratio. It is found that 2S are often small in size with large I/T ratio and 3SDI are frequently large in size with small I/T ratio. Nonetheless, 3SMI have a mixture of these features. Hence, we used these parameters to develop a decision tree model. The decision tree model produced positive predictive values (PPV) of 72% for 2S, 58% for 3SMI and 57% for 3SDI in cross validation. Thus, the method finds application in assigning homodimers with folding mechanism.

Entities: Chemical Disease Gene Species

Keywords: decision tree; folding; homodimer; mechanism; prediction

Year: 2009 PMID： 20461159 PMCID： PMC2859576 DOI： 10.6026/97320630004197

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Homodimers play an important role in catalysis and regulation. The formation of homodimer interface is structurally intriguing [1]. The mechanism of formation of such homodimer interfaces is further appealing. Structures for 47 homodimers with known folding information are now available as given in Table 1 (supplementary material) [2-46]. These homodimers are formed through 2-sate (2S) [2-28], or 3-state with monomer intermediate (3SMI) [3-46]or 3-state with dimer intermediate (3SDI) [29-35]. A couple of homodimers have been described as cancer targets [47,48,49]. Hence, the future definition of homodimers as drug targets is evident. Therefore, it is important to understand both homodimer association and its folding mechanism of formation. A number of attempts have been made to relate homodimer structures to folding mechanism to decipher folding specific structural features [50-54]. We recently documented the relationship between structural features describing homodimer folding mechanism [55]. Nevertheless, folding information on homodimers is far less than the known number of homodimer structures stored in databases [1]. Therefore, it is of interest to predict folding mechanism to known homodimer structures. We created an improved dataset of 47 homodimer structures from PDB with known folding mechanism to glean parameters and to develop models for homodimer folding mechanism prediction given their structures. We then use these parameters to design a decision tree model to classify homodimer structures with unknown folding mechanism.

Methodology

Dataset

We created a dataset of 47 homodimer structures from PDB with known folding information taken from respective literature (Table 1 in supplementary material). The dataset consists of twenty eight 2S, twelve 3SMI and seven 3SDI structures. Table 1 (see supplementary material) also provides information on structural parameters such as monomer length (ML), interface area (B/2) and interface to total residue (I/T) ratio for each structure. The structural features in the dataset are summarized in Table 2 (see supplementary material).

Monomer length (ML)

Monomer length (ML) refers to the protein length of monomers forming the homodimer complex. The distribution of 2S, 3SMI and 3SDI with ML is shown in Figure 1a. The figure illustrates the minimum and maximum limits of ML for 2S, 3SMI and 3SDI homodimers in the dataset. The length of 2S proteins are found in the range of 45 to 271, 3SMI in the range of 72 and 381, while 3SDI between 90 and 835. There is some degree of ML overlap between the three categories of homodimers.

Figure 1

Distribution of 2S, 3SMI and 3SDI for ML, B/2 and I/T is shown. The overlap regions are shown horizontally. 2S proteins range from 6 to 80, 3SMI range from 9 to 44 and 3SDI range from 5 to 50. It should be noted that there is no Y-axis variable defined in this case. However, a Y-axis is shown for convenience of visualization.

An illustration of the minimum and maximum limits of ML for 2S, 3SMI and 3SDI homodimers in the dataset is presented. The X ‐ axis represents monomer length. The overlap regions are shown horizontally. 2S proteins range from 45 to 271, 3SMI range from 72 to 381 and 3SDI range from 90 to 835.

An illustration of the minimum and maximum limits of ML for 2S, 3SMI and 3SDI homodimers in the dataset is presented. The X axis represents interface area. The overlap regions are shown horizontally. 2S proteins range from 156 to 2507, 3SMI range from 309 to 2332 and 3SDI range from 1351 to 2317.

Distribution of 2S, 3SMI and 3SDI for I/T ratio.

Interface area (B/2)

Interface area (B/2) is defined as the change in accessible surface area (delta ASA) when going from monomer state to dimer state during complex formation. Accessible surface area (ASA) is calculated using the software SURFACE RACER 5.0 [56] using the algorithm described by Lee and Richard [56]. The distribution of 2S, 3SMI and 3SDI with B/2 is shown in Figure 1b. The figure shows the graphical representation of homodimers according to their interface area. 2S proteins have B/2 range between 156 -2507 Å2 and 3SMI proteins range within 309 and 2317 Å2. However,3SDI dimers lie between 1351 and 2317 Å2.

Interface to total residue (I/T) ratio

It is the ratio between the numbers of interface residues per monomer (residues involved in homodimer interactions at the interface) to the total number of residues in monomer protein. Interface residues are identified using ASA calculation described in previous section. The distribution of 2S, 3SMI and 3SDI with I/T ratio is shown in Figure 1c. The figure shows the graphical representation of homodimers to I/T ratio. Here, the 3SDI proteins lie in the range of 5 to 50%, and 3SMI in the range of 9 to 44%, while the 2S proteins lie in the range of 6 to 80%.

Decision tree model

A decision model is a clear logical model that can be easily understood by persons who are not mathematically inclined. The decision tree model is a classification tree to classify the target variable (folding mechanism in this case) based on the predictor variables (ML, B/2 and I/T) described in previous sections. The cumulative frequencies of the three predictors (ML, B/2 and I/T) were used to decide the values in the logical conditions of the decision tree. A flowchart describing the decision tree model is illustrated in Figure 3. The model checks for ML, I/T and B/2 for each known homodimer structures to assign their folding mechanism using human expert cut-off values as shown in Figure 3.

Figure 3

A flowchart describing the decision tree model is given. The decision tree model checks for predictor values within defined conditional values for multiple variables in a subsequent manner sequentially so as to reach the respective nodes to predict and assign target variables.

Validation

An internal cross validation is performed for 47 homodimers in Table 1 using the decision tree model described above. The results of the validation using true positive (TP), false positive (FP) and positive predictive value (PPV) is given in Table 5. PPV (%) is defined as TP/(TP+FP)*100.

Assignment dataset

We created a dataset of 149 homodimers with unknown folding information for prediction and assignment of folding mechanism using structural parameters (Table 3 in supplementary material). The structural features in the dataset are summarized in Table 4 (see supplementary material). A classification of 149 homodimers into three target categories using the decision tree model is given in Table 6 (see supplementary material).

Discussion

Protein homodimer molecules have been defined as drug targets in cancer [48-49]. Thus, homodimers have commercial importance in drug discovery. The different folding mechanisms associated with homodimers are interesting and their study is often attractive. Homodimer denaturation experiments using fluorescence [3,4,8,13–15,19,21–27,;30–43>,45>,46] , circular dichroism [2,3,5-12,14,20,26,27,29,31-40,43,44], NMR [18] and adsorption [38] have been used to establish folding mechanism (2S, 3SMI, 3SDI) for a list of homodimers given in Table 1 (see supplementary material). This is time consuming, laborious and tedious. The number of homodimer structures with unknown folding mechanism is substantial [1]. Therefore, it is of interest to predict homodimer folding mechanism given their 3dimenisonal structures. A number of studies have been documented to relate folding and structural features [50-54]. We recently described the trends in parameters (monomer size, interface residues, interface area, hydrophobicity factor, hydrophilic residues and charged residues) for distinguishing 2S from 3S proteins [55]. However, no attempt has been made to predict their folding mechanism given their structures in complex state. Here, we describe a novel decision tree model using predictors ML, B/2 and I/T to predict folding mechanism (target variable) given their structures in complex state. The decision tree model is developed based on the prevalence of weight associated with these predictors in a dataset of structures with known folding data (Figure 1). The distribution of its percent cumulative frequency of predictor variables in the datasets are given in Figures 2. Figure 2a gives percent cumulative frequency of 2S, 3SMI and 3SDI for ML. More than 90% of 2S lie when ML ≦ 250. When ML = 250 only about 15% of 3SDI and 60% of 3SMI are covered. Hence, ML ≦250 was selected as a decisive condition in the development of the model. Figure 2b gives percent cumulative frequency of 2S, 3SMI and 3SDI for I/T ratio. About 90% of 3SMI and 3SDI lie when I/T ≦ 25%. When I/T ≦ 25%, only about 30% of 2S is covered. Therefore, I/T ≦25% was selected as a decision condition in the development of the model. Figure 2c gives percent cumulative frequency of 2S, 3SMI and 3SDI for B/2. When B/2 ≦ 1500, about 70% 3SMI, 50% 2S and 30% 3SDI are covered. So, B/2 ≦ 1500 was selected as a decision condition in the development of the model. Thus, percent cumulative frequency values for predictors are used in the design and development of the decision tree model (Figure 3). The conditional values of the predictor variables are selected based on their biased cumulative frequency in the target categories (datasets). The decision tree model checks for predictor values within defined conditional values for multiple variables in a subsequent manner sequentially so as to reach the respective nodes to predict and assign target variables.

Figure 2

Percent cumulative frequency of 2S, 3SMI and 3SDI for ML, I/T and B/2 is given.

The distribution of the cumulative frequency of ML for 2S, 3SMI and 3SDI homodimers in the dataset is presented. About 90% of 2S, 60% of 3SMI and 15% of 3SDI are covered when ML ≦ 250. Hence, ML ≦250 was selected as a decision condition in the development of the model.

The distribution of the cumulative frequency of I/T ratio for 2S, 3SMI and 3SDI homodimers in the dataset is presented. About 30% of 2S and 90% of 3SMI and 3SDI are covered when I/T ≦ 25%. Hence, I/T ≦25% was selected as a decision condition in the development of the model.

The distribution of the cumulative frequency of interface area for 2S, 3SMI and 3SDI homodimers in the dataset is presented. About 50% of 2S, 70% of 3SMI and 30% of 3SDI are covered when B/2 ≦ 1500. Hence, B/2 ≦ 1500 was selected as a decision condition in the development of the model.

The decision tree model was applied to classify the dataset of 47 homodimers (with known folding data) in a cross validation experiment. The model produced the positive predictive values (PPV) 71.4%, 58.4% and 57.1% for 2S, 3SMI and 3SDI, respectively (Table 5 in supplementary material). We then extended the application of the decision tree model to a dataset of 149 homodimers with no folding data known. The model was able to assign folding data to 132 (88.5%) of 149 structures to predicted target variables with only 17 structures unable to classify (Table 6 in supplementary material). This predicted data serves a framework to understand their folding mechanism given their structures. It should be noted that these predicted mechanism should be verified using denaturation experiments.

Conclusion

It was of interest to predict and classify the homodimer folding mechanism given their structures in complex state. A novel decision tree model is described using structural features (ML, B/2, I/T) derived from known structures to assign folding mechanism for homodimers given their structures. The decision tree model correctly classified with positive predictive values (PPV) 72% for 2S, 58% for 3SMI and 57% for 3SDI into their respective groups in cross validation. Thus, the method finds application in grouping protein homodimer structures with unknown folding data. A number of homodimer structures with unknown folding information are available in PDB. We applied the model to a set of 149 homodimers with unknown folding data. The model classified 132 (88.5% of 149) homodimers into 2S (39), 3SMI (61) and 3SDI (32). Consequently, a framework is established for these 132 known structures with predicted folding data for further experimental verification and confirmation.

55 in total

1. Differential scanning calorimetry of thermal unfolding of the methionine repressor protein (MetJ) from Escherichia coli.

Authors: C M Johnson; A Cooper; P G Stockley
Journal: Biochemistry Date: 1992-10-13 Impact factor: 3.162

2. Dissociation and unfolding of Pi-class glutathione transferase. Evidence for a monomeric inactive intermediate.

Authors: A Aceto; A M Caccuri; P Sacchetta; T Bucciarelli; B Dragani; N Rosato; G Federici; C Di Ilio
Journal: Biochem J Date: 1992-07-01 Impact factor: 3.857

3. Equilibrium unfolding of class pi glutathione S-transferase.

Authors: H W Dirr; P Reinemer
Journal: Biochem Biophys Res Commun Date: 1991-10-15 Impact factor: 3.575

4. Thermodynamic characterization of the conformational stability of the homodimeric protein, pea lectin.

Authors: N Ahmad; V R Srinivas; G B Reddy; A Surolia
Journal: Biochemistry Date: 1998-11-24 Impact factor: 3.162

5. Dimeric tyrosyl-tRNA synthetase from Bacillus stearothermophilus unfolds through a monomeric intermediate. A quantitative analysis under equilibrium conditions.

Authors: Y C Park; H Bedouelle
Journal: J Biol Chem Date: 1998-07-17 Impact factor: 5.157

6. Urea-induced unfolding and conformational stability of 3-isopropylmalate dehydrogenase from the Thermophile thermus thermophilus and its mesophilic counterpart from Escherichia coli.

Authors: C Motono; A Yamagishi; T Oshima
Journal: Biochemistry Date: 1999-01-26 Impact factor: 3.162