| Literature DB >> 21284866 |
Jesse Eickholt1, Xin Deng, Jianlin Cheng.
Abstract
BACKGROUND: Accurate identification of protein domain boundaries is useful for protein structure determination and prediction. However, predicting protein domain boundaries from a sequence is still very challenging and largely unsolved.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21284866 PMCID: PMC3036623 DOI: 10.1186/1471-2105-12-43
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Procedure to identify and extract domain boundary signals. To identify domain boundary signals for a target, homologous sequences are found using PSI-BLAST. The pairwise alignments generated by PSI-BLAST are used to form a multiple sequence alignment with the query sequence as the anchor. A domain boundary signal is defined as a gap which begins at the N or C terminal end of a sequence in the msa and extends continuously for at least 45 residues. With the gaps removed the remaining sequence must be at least 45 residues long for a signal to be generated. Here we see two domain boundary signals for 1B4A (location indicated by large arrows).
Figure 2Domain boundary signal sites for 1CQX. (a) Domain boundary signal site locations which were extracted from a multiple sequence alignment for chain A of protein 1CQX. Signals (denoted by '*") were generated at 28 different residues across this three domain protein. The true domains and domain boundaries are also indicated (boundaries with an '!'). Note that all domain boundaries have signals nearby indicating good coverage of the domain boundaries. (b) Structural plot for chain A of protein 1CQX. The locations of domain boundary signals are shown in orange and true domain boundaries are green.
Boundary site signal classification results for Task-1 and Task-2 using both 10-fold cross validation and leaving one out cross validation.
| Classification Task | Overall Acc. Using 10-Fold Cross Validation | Overall Accuracy Using LOOCV | |
|---|---|---|---|
| Task 1 (near/away boundary VS false boundary) | .80 | .81 | |
| Task 2 (away boundary VS near boundary) | .74 | .76 | |
Figure 3Domain boundary predictions for 1QQG. (a) True domains and domain boundaries (boundaries indicated by '!') and the predicted domain boundaries (indicated by 'x') for chain A of protein 1QQG, a two domain protein with a domain linker delineated by "!". Both domain boundaries are accurately predicted. These predictions were made using a decision threshold of 0.5 (b) Structural plot for chain A of protein 1QQG. The predicted domain boundaries are shaded orange. The linker between the two domains could not be structurally determined (i.e., its coordinates were not available) and is therefore represented by the dashed line.
Figure 4Domain boundary prediction results on multi-domain proteins. (a) We calculated the precision of domain boundary predictions and recall of true domain boundaries at varying decision thresholds. The recall value is calculated for domain boundaries which occur at least 40 residues from the N or C terminal end of a sequence. A domain boundary prediction is considered correct if it occurs within 20 residues of a true domain boundary. (b) Plot of precision and recall with respect to the decision threshold. The break-even point (precision = recall) is 60%.
Classification of proteins as single or multi-domain
| Overall Acc. | Single Dom. Precision | Single Dom. Recall | Multi-Dom. Precision | Multi-Dom Recall |
|---|---|---|---|---|
| 0.82 | 0.88 | 0.86 | 0.68 | 0.72 |
Using the results from Task 1, we classified proteins as a single or multi-domain. Any protein which generated at least one boundary signal which was classified as a near/away boundary signal was considered a multi-domain protein.
Classifcation of CASP9 targets as single or multi-domain
| Predictor | Accuracy | Single Dom. Precision | Single Dom. Recall | Multi-Dom. Precision | Multi-Dom Recall |
|---|---|---|---|---|---|
| DOMPro | 0.72 | 0.82 | 0.84 | 0.30 | 0.28 |
| PPRODO | 0.63 | 0.84 | 0.65 | 0.30 | 0.56 |
| DoBo | 0.78 | 0.90 | 0.81 | 0.50 | 0.68 |
Using DOMPro, PPRODO and our method DoBo, we classified all CASP9 targets as single or multi-domain. For PPRODO, predictions were based on the authors' documented procedure for predicting domain number [13]. For Dobo, any target which generated at least one boundary signal which was classified as a near/away boundary signal was considered to be multi-domain.
Precision and recall of domain boundary predictions on CASP9 continuous, multi-domain targets
| Predictor | Precision of Domain Boundary Prediction | Recall of Domain Boundaries |
|---|---|---|
| DOMPro | 0.50 | 0.14 |
| PPRODO | 0.50 | 0.52 |
| DoBo | 0.49 | 0.70 |
For the 14 continuous, multi-domain targets from CASP9, we used DOMPro, PPRODO and our method DoBo to predicted domain boundaries. Only domain boundary predictions which were more than 40 residues from the N or C terminal end of a sequence were considered. A domain boundary prediction is considered correct if it occurs within 20 residues of a true domain boundary. The recall value is calculated for domain boundaries which occur at least 40 residues from the N or C terminal end of a sequence.
Continuous, multi-domain CASP9 targets and domain definitions
| Target | Domain Definitions |
|---|---|
| T0529 | 7-339, 364-561 |
| T0537 | 65-350, 351-381 |
| T0542 | 2-302, 303-585 * |
| T0548 | 12-46, 47-106 |
| T0550 | 31-117, 178-339 |
| T0553 | 3-65, 66-136 |
| T0571 | 32-196, 197-331 |
| T0575 | 1-63, 64-216 * |
| T0582 | 2-122, 123-221 |
| T0586 | 5-84, 85-123 |
| T0596 | 6-58, 59-188 |
| T0600 | 17-75, 76-122 |
| T0608 | 29-117, 118-278 |
| T0611 | 3-55, 56-213 |
The target numbers and domain definitions used when evaluating domain boundary predictions on the CASP9 dataset. For targets T0542 and T0575, a portion of the domain definition was disjoined. These disjoined portions were consolidated into one range.