| Literature DB >> 20003426 |
Ren-Xiang Yan1, Jing-Na Si, Chuan Wang, Ziding Zhang.
Abstract
BACKGROUND: Machine learning-based methods have been proven to be powerful in developing new fold recognition tools. In our previous work [Zhang, Kochhar and Grigorov (2005) Protein Science, 14: 431-444], a machine learning-based method called DescFold was established by using Support Vector Machines (SVMs) to combine the following four descriptors: a profile-sequence-alignment-based descriptor using Psi-blast e-values and bit scores, a sequence-profile-alignment-based descriptor using Rps-blast e-values and bit scores, a descriptor based on secondary structure element alignment (SSEA), and a descriptor based on the occurrence of PROSITE functional motifs. In this work, we focus on the improvement of DescFold by incorporating more powerful descriptors and setting up a user-friendly web server.Entities:
Mesh:
Substances:
Year: 2009 PMID: 20003426 PMCID: PMC2803855 DOI: 10.1186/1471-2105-10-416
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Sensitivity of fold recognition based on individual descriptors.
| Descriptors | Sensitivity |
|---|---|
| SSEA | 524/1835 = 28.56% |
| Psi-blast | 676/1835 = 36.84% |
| Rps-blast | 688/1835 = 37.49% |
| Motif | 360/1835 = 19.62% |
| PPA | 1083/1835 = 59.02% |
| PSPA | 1052/1835 = 57.33% |
Sensitivity of DescFold using different descriptorsa.
| Descriptors included | Sensitivity |
|---|---|
| SSEA + Psi-blast + Rps-blast | 937/1835 = 51.06% |
| SSEA + Psi-blast + Rps-blast + motif | 1025/1835 = 55.86% |
| SSEA + Psi-blast + Rps-blast + motif + PPA | 1248/1835 = 68.01% |
| SSEA + Psi-blast + Rps-blast + motif + PPA + PSPA | 1322/1835 = 72.04% |
aThe evaluation reflects fold identification performance of all proteins in the SCOP_1.73_1835 dataset. For each protein, only the generated top hit was taken into account.
Figure 1Performance of fold recognition using different descriptors. True positive instances versus false positive instances were used to examine the number of true positives out of 1,835 proteins identified by varying similarity scores.
Figure 2Performance of remote homology identification using different descriptors. True positive rates versus false positive rates were used to examine the number of true positives out of 8,244 protein pairs identified by varying similarity scores.
The ROCn scores and the corresponding sensitivity values of DescFold using different descriptors.a
| Descriptors included | ROC16,744 (Sn)b, c | ROC83,720 (Sn)b, c | ROC167,440 (Sn)b, c | AUC |
|---|---|---|---|---|
| SSEA + Psi-blast + Rps-blast | 0.0029 (34.70%) | 0.0209 (52.76%) | 0.0506 (64.56%) | 0.8768 |
| SSEA + Psi-blast + Rps-blast + motif | 0.0032 (37.73%) | 0.0223 (55.25%) | 0.0529 (66.49%) | 0.8831 |
| SSEA + Psi-blast + Rps-blast + motif + PPA | 0.0041 (46.94%) | 0.0256 (60.06%) | 0.0584 (70.21%) | 0.8962 |
| SSEA + Psi-blast + Rps-blast + motif + PPA + PSPA | 0.0050 (70.21%) | 0.0305 (68.51%) | 0.0668 (75.93%) | 0.9143 |
aThese measurements reflect the performance of remote homology identification for all protein pairs within the SCOP_1.73_1835 dataset.
bROC16,744, ROC83,720, and ROC167,440 stand for the ROCn scores at 1%, 5%, and 10% false positive rates, respectively.
cThe value inside the parentheses denotes the corresponding sensitivity.
Figure 3Cartoon representation of two remote homologs (SCOP entries: d2a13a1 and d1hmsa_) successfully detected by DescFold. The structural alignment between d2a13a1 (red) and d1hmsa_ (green) was carried out by using CE [51]. The RMSD for 121 structurally aligned residues is 3.6 Å, and the CE Z-Score is 5.2.
Figure 4Snapshot of the DescFold website. (A) The submission page of DescFold. (B) The result page of DescFold.
Figure 5Performance of DescFold based on the SCOP_1.75_1866 test set. The performance was measured at the fold (A) and superfamily (B) levels, respectively.
Comparison of receiver operator characteristics (< = 10 false positives) and sensitivity for different fold recognition methods based on all LiveBench-2008.1 targets.a
| Receiver operator characteristics (< = 10 false positives)b | Sensitivityc | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | All | Trivial | Easy | Hard | |
| FFASd | 85 | 94 | 119 | 133 | 135 | 139 | 140 | 140 | 140 | 140 | 150 | 8 | 103 | 39 |
| Inubd | 73 | 89 | 106 | 116 | 120 | 121 | 121 | 121 | 121 | 121 | 134 | 6 | 91 | 37 |
| Fugued | 61 | 79 | 81 | 85 | 87 | 96 | 101 | 102 | 104 | 104 | 135 | 8 | 95 | 32 |
| mGenThreaderd | 77 | 89 | 89 | 90 | 90 | 93 | 97 | 97 | 98 | 98 | 143 | 8 | 97 | 38 |
| 3D-PSSMd | 48 | 55 | 72 | 75 | 78 | 80 | 86 | 86 | 87 | 89 | 102 | 5 | 75 | 22 |
| DescFolde | 87 | 89 | 99 | 103 | 104 | 108 | 111 | 114 | 115 | 116 | 134 | 8 | 92 | 34 |
a LiveBench-2008.1 contains 283 targets, which can be divided into 9 trivial, 109 easy and 165 hard targets. As defined by the developer of LiveBench, trivial targets means those proteins sharing strong sequence similarity to the other previously known structures, as measured by Blast using an e-value < 0.001. The division of easy and hard targets is based on whether a structural template can be identified by Psi-blast with an e-value < 0.001.
b1-10: number of correct predictions with higher reliability than the 1-10 false prediction.
c Number of correct predictions for all, trivial, easy and hard targets, respectively.
dThe results for FFAS, Inub, Fugue, mGenThreader, and 3D-PSSM were cited from http://meta.bioinfo.pl/results.pl?comp_name=livebench-2008.1
eThe performance was evaluated based on the number of correctly assigned folds. We considered two hits as similar, provided that the Z-Score obtained by applying the CE structural alignment algorithm [51] was > = 4.0.
Comparison of receiver operator characteristics (< = 10 false positives) for different fold recognition methods based on all LiveBench-2008.2 targets.a
| Receiver operator characteristics (< = 10 false positives)b | Sensitivityc | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | All | Trivial | Easy | Hard | |
| FFASd | 121 | 174 | 205 | 218 | 228 | 263 | 267 | 269 | 278 | 278 | 302 | 15 | 218 | 69 |
| Inubd | 29 | 34 | 126 | 149 | 183 | 195 | 209 | 210 | 211 | 228 | 257 | 14 | 189 | 54 |
| Fugued | 129 | 186 | 199 | 219 | 221 | 223 | 224 | 225 | 225 | 225 | 285 | 16 | 213 | 56 |
| mGenThreaderd | 179 | 197 | 205 | 211 | 215 | 215 | 216 | 222 | 232 | 232 | 290 | 16 | 215 | 59 |
| 3D-PSSMd | 25 | 75 | 83 | 97 | 127 | 140 | 175 | 176 | 178 | 179 | 220 | 12 | 181 | 27 |
| DescFolde | 158 | 190 | 190 | 211 | 215 | 212 | 215 | 220 | 224 | 224 | 294 | 15 | 210 | 69 |
a LiveBench-2008.2 has a total number of 513 targets, including 16 trivial, 246 easy and 256 hard targets. Please refer to the footnote of Table 4 for the definitions of trivial, easy and hard targets.
b1-10: number of correct predictions with higher reliability than the 1-10 false prediction.
c Number of correct predictions for all, trivial, easy and hard targets, respectively.
d The results for FFAS, Inub, Fugue, mGenThreader, and 3D-PSSM were cited from http://meta.bioinfo.pl/results.pl?comp_name=livebench-2008.2
eThe performance was evaluated based on the number of correctly assigned folds. We considered two hits as similar, provided that the Z-Score obtained by applying the CE structural alignment algorithm [51] was > = 4.0.
The sensitivity of different methods on the Lindahl dataset at the family, superfamily, and fold levels.a, b
| Method | Family level (%) | Superfamily level (%) | Fold level (%) | |||
|---|---|---|---|---|---|---|
| Top 1 | Top 5 | Top 1 | Top 5 | Top 1 | Top 5 | |
| Psi-blastc | 71.2 | 72.3 | 27.4 | 27.9 | 4.0 | 4.7 |
| Fuguec | 82.2 | 85.8 | 41.9 | 53.2 | 12.5 | 26.8 |
| FOLDproc | 55.0 | 70.0 | 26.5 | 48.3 | ||
| HHpredd | 82.9 | 87.1 | 58.0 | 70.0 | 25.2 | 39.4 |
| Sparksd | 81.6 | 88.1 | 52.5 | 69.1 | 24.3 | 47.7 |
| SP3d | 81.6 | 86.8 | 55.3 | 67.7 | 28.7 | 47.4 |
| SP4d | 80.9 | 86.3 | 57.8 | 68.9 | 30.8 | 53.6 |
| SP5d | 82.4 | 87.6 | 59.8 | 70.0 | 58.7 | |
| DescFold_Ie | 80.7 | 88.5 | 57.8 | 69.1 | 24.9 | 55.8 |
| DescFold_IIf | 81.1 | 88.5 | 32.4 | |||
a The sensitivity is defined by the percentage of query proteins having at least one correct hit ranked first, or within the top 5.
bValues in bold are the best results.
cThe results were cited from Ref. [25].
d The results were cited from Ref. [16].
e DescFold_I was based on the SSEA-, Psi-blast-, Rps-blast-, and PPA-based descriptors.
f DescFold_II was based on the SSEA-, Psi-blast-, Rps-blast-, motif-, and PPA-based descriptors.
The F-scores of ten input features used in building the SVM models.
| Feature | F-Score |
|---|---|
| 0.421 | |
| 0.368 | |
| 0.279 | |
| 0.217 | |
| 0.162 | |
| 0.135 | |
| 0.119 | |
| 0.081 | |
| 0.062 | |
| 0.026 |