| Literature DB >> 19056827 |
Rajkumar Bondugula1, Michael S Lee, Anders Wallqvist.
Abstract
Protein domain prediction is often the preliminary step in both experimental and computational protein research. Here we present a new method to predict the domain boundaries of a multidomain protein from its amino acid sequence using a fuzzy mean operator. Using the nr-sequence database together with a reference protein set (RPS) containing known domain boundaries, the operator is used to assign a likelihood value for each residue of the query sequence as belonging to a domain boundary. This procedure robustly identifies contiguous boundary regions. For a dataset with a maximum sequence identity of 30%, the average domain prediction accuracy of our method is 97% for one domain proteins and 58% for multidomain proteins. The presented model is capable of using new sequence/structure information without re-parameterization after each RPS update. When tested on a current database using a four year old RPS and on a database that contains different domain definitions than those used to train the models, our method consistently yielded the same accuracy while two other published methods did not. A comparison with other domain prediction methods used in the CASP7 competition indicates that our method performs better than existing sequence-based methods.Entities:
Mesh:
Year: 2008 PMID: 19056827 PMCID: PMC2632928 DOI: 10.1093/nar/gkn944
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Domain composition of proteins contained in the SCOP databases used in this work
| Number of domains | SCOP database version (maximum percentage sequence identity) | |||||
|---|---|---|---|---|---|---|
| 1.65 (30%) | 1.69 (20%) | 1.69 (30%) | 1.69 (40%) | 1.73 (30%) | 1.73 (95%) | |
| One | 3145 | 3449 | 4153 | 4724 | 5432 | 10 303 |
| Two | 533 | 494 | 627 | 789 | 826 | 1653 |
| Three | 107 | 96 | 123 | 157 | 148 | 267 |
| Four | 20 | 9 | 21 | 25 | 25 | 66 |
| Total | 3805 | 4048 | 4924 | 5695 | 6431 | 12 289 |
Data in the first row indicate the number of one-domain proteins in each database. The second row contains the number of two-domain proteins, etc. The last row indicates the total number of proteins included in each database.
Figure 1.The fragments retrieved when the RPS is searched for matching fragments with a typical protein. The fragments shown are labeled using their SCOP definitions. Residues labeled ‘D’ lie in protein domains, whereas residues labeled ‘B’™ lie on the domain boundary; ‘–’ is used to indicate that no residue in the current fragment is aligned with the query sequence. For the Alanine residue (A) in the shaded box, the domain boundary propensity is calculated using Equation 2 based on the five aligned residues (K = 5), four of which are found in non-boundary regions and one is found in a boundary region. The importance of these contributions is inversely weighted by their respective scores, S, shown on the right, as detailed in Equation 2. In this case, the likelihood P that the alanine residue belongs to domain boundary is 0.0804.
Figure 2.The predicted raw domain boundary propensity (solid line) of the Escherichia coli MurF enzyme, PDB code 1GG4, chain A. Two regions that potentially contain domain boundaries are identified. The post-processing results in two predicted boundaries centered on residues 91 and 314 (dotted lines), whereas the true boundaries are centered on residues 98 and 313 (data not shown). The background noise that gets filtered out during the post-processing can be seen at the COOH- and NH2-terminal ends of the sequence.
Figure 3.The effect of threshold on the performance of FIEFDom for the SCOP 1.73 (30%) dataset. (a) Receiver operating characteristic (ROC) curve averaged over all of the domain sets is plotted as the threshold (T) is varied from 0 to 1 in intervals of 0.1. (b) One-domain (blue solid line), two-domain (pink dashed line), three-domain (black dotted line), four-domain (red dashed-dotted line) and the average domain boundary prediction accuracy are plotted as a function of the threshold value, T. Based on the maximum and slow variability of the accuracy values over a range of T values, we selected T = 0.4 as the appropriate value to be used in our model.
Studying the effect of homolog availability for building profiles, the number of proteins in the RPS and the effect of maximum sequence identity among the sequences in the RPS on the performance of FIEFDom
| Database | Alignment | Number of domains | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| One | Two | Three | Four | ||||||||
| Homolog availability | |||||||||||
| SCOP 1.73 (30%) | 97 | 88 | 60 | 55 | 95 | 61 | 59 | 90 | 63 | 59 | |
| SCOP 1.73 (30%) | 99 | 95 | 40 | 39 | 94 | 41 | 40 | 86 | 62 | 57 | |
| Number of proteins in RPS | |||||||||||
| SCOP 1.65 (30%) | 97 | 86 | 54 | 50 | 96 | 58 | 57 | 93 | 45 | 44 | |
| SCOP 1.69 (30%) | 97 | 90 | 57 | 54 | 93 | 58 | 56 | 91 | 49 | 47 | |
| SCOP 1.73 (30%) | 97 | 88 | 60 | 55 | 95 | 61 | 59 | 90 | 63 | 59 | |
| Maximum sequence identity in RPS | |||||||||||
| SCOP 1.69 (20%) | 97 | 86 | 43 | 41 | 90 | 42 | 40 | 71 | 19 | 17 | |
| SCOP 1.69 (30%) | 97 | 90 | 57 | 54 | 93 | 58 | 56 | 91 | 49 | 47 | |
| SCOP 1.69 (40%) | 97 | 91 | 67 | 63 | 92 | 66 | 62 | 93 | 56 | 54 | |
A, accuracy; Sp, specificity; Sn, sensitivity. Alignment: PS profile-sequence, SS- sequence-sequence alignment. All values are percentages. Top: The availability of homology information for query sequences is simulated by using either the query profile (profile-sequence consistent with high availability) or the query sequence itself (sequence-sequence consistent with low availability) to search for identical fragments in the RPS. For multidomain proteins, the profile-sequence yields on average 13% higher overall accuracy, compared to the sequence-sequence alignment method. Middle: Every other version of the SCOP database, with 30% maximum sequence identity among the proteins, is used to study the effect of number of proteins in the RPS. The larger the size of the RPS (see Table 1 for the detailed breakdown in number of proteins and domain compositions), the higher is the average domain boundary prediction accuracy for multidomain proteins, presumably because the additional structure/sequence information uncovered as additional novel structures are added to the database. Bottom: Three simulations were conducted by experimenting with databases of three different maximum sequence identities among the reference proteins. The maximum sequence identity among the reference proteins varies from 20% to 40%.
Figure 4.(a) One-domain (red dashed line), two-domain (blue dashed-dotted line), three-domain (green dotted line), four-domain (solid magenta line) and average (bold solid black line) domain prediction accuracies are plotted as a function of database version. As time progresses, new information can be added to the prediction algorithm by updating the RPS. As the number of sequences in the database increases, the prediction accuracy improves. (b) The same domain prediction accuracies as in (a) are plotted as a function of maximum sequence identity cutoff in the RPS. More structural information is added to the prediction system by increasing the maximum sequence identity among proteins in the RPS.
The performance metrics of the three programs on a dataset that is about four years further in time from the training or reference data
| Method | Number of domains | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
One | Two | Three | Four | |||||||
| FIEFDom | 97 | 93 | 77 | 73 | 96 | 85 | 82 | 94 | 88 | 84 |
| PPRODO | 56 | 53 | 54 | 37 | 50 | 38 | 28 | 78 | 51 | 44 |
| DOMpro | 80 | 32 | 12 | 10 | 34 | 14 | 11 | 55 | 23 | 19 |
| FIEFDom (only two-domains) | 91 | 94 | 73 | 70 | 80 | 39 | 36 | 90 | 35 | 33 |
| FIEFDom (only multidomains) | 89 | 91 | 76 | 71 | 95 | 86 | 82 | 96 | 88 | 85 |
All values are percentages. Five prediction sets were generated to understand how FIEFDom (with three versions of the same RPS), PPRODO and DOMpro perform on the SCOP 1.73 (30%) database. The first row shows the performance of FIEFDom that uses the SCOP 1.65 (30%) database as the RPS. The second and third rows show the performance of PPRODO and DOMpro, respectively. The fourth and fifth rows show the performance of FIEFDom that uses a RPS containing only two-domain proteins or multidomain proteins, respectively.
The performance metrics of the three programs on a dataset that uses domain definitions derived from the CATH database
| Method | Number of domains | |||
|---|---|---|---|---|
One | Multi | |||
| FIEFDom | 92 | 91 | 65 | 61 |
| PPRODO | 90 | 58 | 51 | 37 |
| DOMpro | 91 | 58 | 21 | 18 |
| FIEFDom (only two domains) | 89 | 91 | 50 | 48 |
| FIEFDom (only multidomain) | 89 | 91 | 62 | 58 |
All values are percentages. Five prediction sets were generated to understand how FIEFDom (with three versions of the same RPS), PPRODO and DOMpro perform on a database that derives its domain definitions from the CATH database (version 2.5.1). The results for two-, three- and four-domain proteins have been averaged and are shown under ‘Multi’. The first row shows the performance of FIEFDom that uses the SCOP 1.65 (30%) database as the RPS. The second and third rows show the performance of PPRODO and DOMpro, respectively. The fourth and the fifth rows show the performance of FIEFDom that uses a RPS containing only two-domain proteins or multidomain proteins, respectively.
The performance of various sequence-based domain prediction methods on the 97 (70 one-domain proteins and 27 multidomain proteins) CASP7 targets
| Methods | Domain number | Domain position | |||||
|---|---|---|---|---|---|---|---|
One | Multi | Combined | Multi | ||||
| FIEFDom | 100 | 88.9 | 30.8 | 29.6 | 64.8 | 6 | 2 |
| CHOP ( | 55.8 | 37.5 | 42.9 | 25.0 | 40.4 | 4 | 4 |
| DomSSEA ( | 92.9 | 100 | 30.8 | 30.8 | 61.8 | 4 | 4 |
| DPS | 80.5 | 100 | 42.3 | 42.3 | 61.4 | 5 | 2 |
| HHPred1 | 95.6 | 100 | 25.9 | 25.9 | 60.8 | 4 | 3 |
| HHPred3 | 95.7 | 100 | 25.9 | 25.9 | 60.8 | 4 | 3 |
| NNPutLab | 78.5 | 80.0 | 15.4 | 14.8 | 46.6 | 2 | 3 |
All values under the domain number prediction are percentages. Sequence-based domain prediction methods that were used in the CASP7 are listed on left. For one domain number prediction, the accuracy (A) is listed. For multidomain number prediction, accuracy (A), specificity (Sp) and sensitivity (Sn) are listed. The domain number prediction accuracy for all targets in CASP7 set is listed under the ‘Combined’ heading. For the domain position prediction of multidomain proteins, the actual count of the proteins whose domain boundaries are predicted completely correct and partially correct is listed.
ahttp://predictioncenter.org/casp7/meeting_docs/abstractsd.pdf.