The formation of protein homodimer complexes for molecular catalysis and regulation is fascinating. The homodimer formation through 2S (2 state), 3SMI (3 state with monomer intermediate) and 3SDI (3 state with dimer intermediate) folding mechanism is known for 47 homodimer structures. Our dataset of forty-seven homodimers consists of twenty-eight 2S, twelve 3SMI and seven 3SDI. The dataset is characterized using monomer length, interface area and interface/total (I/T) residue ratio. It is found that 2S are often small in size with large I/T ratio and 3SDI are frequently large in size with small I/T ratio. Nonetheless, 3SMI have a mixture of these features. Hence, we used these parameters to develop a decision tree model. The decision tree model produced positive predictive values (PPV) of 72% for 2S, 58% for 3SMI and 57% for 3SDI in cross validation. Thus, the method finds application in assigning homodimers with folding mechanism.
The formation of protein homodimer complexes for molecular catalysis and regulation is fascinating. The homodimer formation through 2S (2 state), 3SMI (3 state with monomer intermediate) and 3SDI (3 state with dimer intermediate) folding mechanism is known for 47 homodimer structures. Our dataset of forty-seven homodimers consists of twenty-eight 2S, twelve 3SMI and seven 3SDI. The dataset is characterized using monomer length, interface area and interface/total (I/T) residue ratio. It is found that 2S are often small in size with large I/T ratio and 3SDI are frequently large in size with small I/T ratio. Nonetheless, 3SMI have a mixture of these features. Hence, we used these parameters to develop a decision tree model. The decision tree model produced positive predictive values (PPV) of 72% for 2S, 58% for 3SMI and 57% for 3SDI in cross validation. Thus, the method finds application in assigning homodimers with folding mechanism.
Homodimers play an important role in catalysis and regulation. The
formation of homodimer interface is structurally intriguing [1]. The
mechanism of formation of such homodimer interfaces is further
appealing. Structures for 47 homodimers with known folding
information are now available as given in Table 1 (supplementary material)
[2-46]. These homodimers are formed through 2-sate (2S)
[2-28], or 3-state with monomer intermediate (3SMI)
[3-46]or 3-state with dimer intermediate (3SDI)
[29-35]. A couple of
homodimers have been described as cancer targets
[47,48,49].
Hence, the future definition of homodimers as drug targets is
evident. Therefore, it is important to understand both homodimer
association and its folding mechanism of formation. A number of
attempts have been made to relate homodimer structures to folding
mechanism to decipher folding specific structural features [50-54].
We recently documented the relationship between structural
features describing homodimer folding mechanism [55].
Nevertheless, folding information on homodimers is far less than
the known number of homodimer structures stored in databases [1].
Therefore, it is of interest to predict folding mechanism to known
homodimer structures. We created an improved dataset of 47
homodimer structures from PDB with known folding mechanism to
glean parameters and to develop models for homodimer folding
mechanism prediction given their structures. We then use these
parameters to design a decision tree model to classify homodimer
structures with unknown folding mechanism.
Methodology
Dataset
We created a dataset of 47 homodimer structures from PDB with
known folding information taken from respective literature (Table 1
in supplementary material). The dataset consists of twenty eight
2S, twelve 3SMI and seven 3SDI structures. Table 1 (see
supplementary material) also provides information on structural
parameters such as monomer length (ML), interface area (B/2) and
interface to total residue (I/T) ratio for each structure. The structural
features in the dataset are summarized in Table 2 (see
supplementary material).
Monomer length (ML)
Monomer length (ML) refers to the protein length of monomers
forming the homodimer complex. The distribution of 2S, 3SMI and
3SDI with ML is shown in Figure 1a. The figure illustrates the
minimum and maximum limits of ML for 2S, 3SMI and 3SDI
homodimers in the dataset. The length of 2S proteins are found in
the range of 45 to 271, 3SMI in the range of 72 and 381, while 3SDI
between 90 and 835. There is some degree of ML overlap between
the three categories of homodimers.
Figure 1
Distribution of 2S, 3SMI and 3SDI for ML, B/2 and I/T is shown.
The overlap regions are shown horizontally. 2S proteins
range from 6 to 80, 3SMI range from 9 to 44 and 3SDI range from 5 to 50. It should be noted that there is no Y-axis variable defined in this
case. However, a Y-axis is shown for convenience of visualization.
An illustration of the minimum and maximum limits of ML for 2S, 3SMI and 3SDI homodimers in the dataset is presented. The X ‐ axis represents monomer length.
The overlap regions are shown horizontally. 2S proteins range from 45 to 271, 3SMI range from 72 to 381 and 3SDI range from 90 to 835.
An illustration of the
minimum and maximum limits of ML for 2S, 3SMI and 3SDI homodimers in the dataset is presented. The X axis represents interface area. The overlap regions are shown horizontally. 2S
proteins range from 156 to 2507, 3SMI range from 309 to 2332 and 3SDI range from 1351 to 2317.
Distribution of 2S, 3SMI and 3SDI for I/T ratio.
Interface area (B/2)
Interface area (B/2) is defined as the change in accessible surface
area (delta ASA) when going from monomer state to dimer state during complex formation. Accessible surface area (ASA) is
calculated using the software SURFACE RACER 5.0 [56] using the
algorithm described by Lee and Richard [56]. The distribution of
2S, 3SMI and 3SDI with B/2 is shown in Figure 1b. The figure
shows the graphical representation of homodimers according to their interface area. 2S proteins have B/2 range between
156 -2507 Å2 and 3SMI proteins range within 309 and 2317 Å2. However,3SDI dimers lie between 1351 and 2317 Å2.
Interface to total residue (I/T) ratio
It is the ratio between the numbers of interface residues per
monomer (residues involved in homodimer interactions at the
interface) to the total number of residues in monomer protein.
Interface residues are identified using ASA calculation described in
previous section. The distribution of 2S, 3SMI and 3SDI with I/T
ratio is shown in Figure 1c. The figure shows the graphical
representation of homodimers to I/T ratio. Here, the 3SDI proteins
lie in the range of 5 to 50%, and 3SMI in the range of 9 to 44%,
while the 2S proteins lie in the range of 6 to 80%.
Decision tree model
A decision model is a clear logical model that can be easily
understood by persons who are not mathematically inclined. The
decision tree model is a classification tree to classify the target
variable (folding mechanism in this case) based on the predictor
variables (ML, B/2 and I/T) described in previous sections. The
cumulative frequencies of the three predictors (ML, B/2 and I/T)
were used to decide the values in the logical conditions of the
decision tree. A flowchart describing the decision tree model is
illustrated in Figure 3. The model checks for ML, I/T and B/2 for
each known homodimer structures to assign their folding
mechanism using human expert cut-off values as shown in Figure 3.
Figure 3
A flowchart describing the decision tree model is given. The decision tree model checks for predictor values within defined
conditional values for multiple variables in a subsequent manner sequentially so as to reach the respective nodes to predict and assign target
variables.
Validation
An internal cross validation is performed for 47 homodimers in
Table 1 using the decision tree model described above. The results
of the validation using true positive (TP), false positive (FP) and
positive predictive value (PPV) is given in Table 5. PPV (%) is
defined as TP/(TP+FP)*100.
Assignment dataset
We created a dataset of 149 homodimers with unknown folding
information for prediction and assignment of folding mechanism
using structural parameters (Table 3 in supplementary material).
The structural features in the dataset are summarized in Table 4
(see supplementary material). A classification of 149 homodimers
into three target categories using the decision tree model is given in
Table 6 (see supplementary material).
Discussion
Protein homodimer molecules have been defined as drug targets in
cancer [48-49]. Thus, homodimers have commercial importance in
drug discovery. The different folding mechanisms associated with
homodimers are interesting and their study is often attractive.
Homodimer denaturation experiments using fluorescence
[3,4,8,13–15,19,21–27,;30–43>,45>,46]
, circular dichroism
[2,3,5-12,14,20,26,27,29,31-40,43,44],
NMR [18] and adsorption [38] have
been used to establish folding mechanism (2S, 3SMI, 3SDI) for a
list of homodimers given in Table 1 (see supplementary material). This is time consuming, laborious and tedious. The
number of homodimer structures with unknown folding mechanism
is substantial [1]. Therefore, it is of interest to predict homodimer
folding mechanism given their 3dimenisonal structures. A number
of studies have been documented to relate folding and structural
features [50-54]. We recently described the trends in parameters
(monomer size, interface residues, interface area, hydrophobicity
factor, hydrophilic residues and charged residues) for distinguishing
2S from 3S proteins [55]. However, no attempt has been made to
predict their folding mechanism given their structures in complex
state. Here, we describe a novel decision tree model using predictors
ML, B/2 and I/T to predict folding mechanism (target variable)
given their structures in complex state.The decision tree model is developed based on the prevalence of
weight associated with these predictors in a dataset of structures
with known folding data (Figure 1). The distribution of its percent
cumulative frequency of predictor variables in the datasets are given
in Figures 2. Figure 2a gives percent cumulative frequency of 2S,
3SMI and 3SDI for ML. More than 90% of 2S lie when ML ≦ 250.
When ML = 250 only about 15% of 3SDI and 60% of 3SMI are
covered. Hence, ML ≦250 was selected as a decisive condition in
the development of the model. Figure 2b gives percent cumulative
frequency of 2S, 3SMI and 3SDI for I/T ratio. About 90% of 3SMI
and 3SDI lie when I/T ≦ 25%. When I/T ≦ 25%, only about 30%
of 2S is covered. Therefore, I/T ≦25% was selected as a decision
condition in the development of the model. Figure 2c gives percent
cumulative frequency of 2S, 3SMI and 3SDI for B/2. When B/2 ≦
1500, about 70% 3SMI, 50% 2S and 30% 3SDI are covered. So,
B/2 ≦ 1500 was selected as a decision condition in the
development of the model. Thus, percent cumulative frequency
values for predictors are used in the design and development of the
decision tree model (Figure 3). The conditional values of the
predictor variables are selected based on their biased cumulative
frequency in the target categories (datasets). The decision tree
model checks for predictor values within defined conditional values
for multiple variables in a subsequent manner sequentially so as to
reach the respective nodes to predict and assign target variables.
Figure 2
Percent cumulative frequency of 2S, 3SMI and 3SDI for ML, I/T and B/2 is given.
The distribution of the cumulative
frequency of ML for 2S, 3SMI and 3SDI homodimers in the dataset is presented. About 90% of 2S, 60% of 3SMI and 15% of 3SDI are
covered when ML ≦ 250. Hence, ML ≦250 was selected as a decision condition in the development of the model.
The distribution of
the cumulative frequency of I/T ratio for 2S, 3SMI and 3SDI homodimers in the dataset is presented. About 30% of 2S and 90% of 3SMI
and 3SDI are covered when I/T ≦ 25%. Hence, I/T ≦25% was selected as a decision condition in the development of the model.
The
distribution of the cumulative frequency of interface area for 2S, 3SMI and 3SDI homodimers in the dataset is presented. About 50% of 2S,
70% of 3SMI and 30% of 3SDI are covered when B/2 ≦ 1500. Hence, B/2 ≦ 1500 was selected as a decision condition in the
development of the model.
The decision tree model was applied to classify the dataset of 47
homodimers (with known folding data) in a cross validation
experiment. The model produced the positive predictive values
(PPV) 71.4%, 58.4% and 57.1% for 2S, 3SMI and 3SDI,
respectively (Table 5 in supplementary material). We then
extended the application of the decision tree model to a dataset of
149 homodimers with no folding data known. The model was able
to assign folding data to 132 (88.5%) of 149 structures to predicted
target variables with only 17 structures unable to classify (Table 6
in supplementary material). This predicted data serves a
framework to understand their folding mechanism given their
structures. It should be noted that these predicted mechanism should
be verified using denaturation experiments.
Conclusion
It was of interest to predict and classify the homodimer folding
mechanism given their structures in complex state. A novel decision
tree model is described using structural features (ML, B/2, I/T)
derived from known structures to assign folding mechanism for
homodimers given their structures. The decision tree model
correctly classified with positive predictive values (PPV) 72% for
2S, 58% for 3SMI and 57% for 3SDI into their respective groups in
cross validation. Thus, the method finds application in grouping
protein homodimer structures with unknown folding data. A number
of homodimer structures with unknown folding information are
available in PDB. We applied the model to a set of 149 homodimers
with unknown folding data. The model classified 132 (88.5% of
149) homodimers into 2S (39), 3SMI (61) and 3SDI (32).
Consequently, a framework is established for these 132 known
structures with predicted folding data for further experimental
verification and confirmation.
Authors: A Aceto; A M Caccuri; P Sacchetta; T Bucciarelli; B Dragani; N Rosato; G Federici; C Di Ilio Journal: Biochem J Date: 1992-07-01 Impact factor: 3.857