| Literature DB >> 20030857 |
Jayashree Ramana1, Dinesh Gupta.
Abstract
BACKGROUND: Functional annotation of rapidly amassing nucleotide and protein sequences presents a challenging task for modern bioinformatics. This is particularly true for protein families sharing extremely low sequence identity, as for lipocalins, a family of proteins with varied functions and great diversity at the sequence level, yet conserved structures.Entities:
Mesh:
Substances:
Year: 2009 PMID: 20030857 PMCID: PMC2813246 DOI: 10.1186/1471-2105-10-445
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performance of SVM classifiers for various combinations of training features, kernels, parameters and validation methods
| Feature | V* | Kernel | Parameters | SN (%) | SP (%) | Acc (%) | MCC | F measure | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| AAC | A | R | -0.1 | 1 | 0.01 | - | 72.79 | 77.10 | 75.16 | 0.498 | 1.429 |
| DPC | A | P | -0.1 | 0 | - | 2 | 80.14 | 87.34 | 84.10 | 0.678 | 1.658 |
| PSSM | A | R | -0.1 | 5 | 9 | - | 89.70 | 89.15 | 89.40 | 0.786 | 1.725 |
| D | R | -0.1 | 5 | 9 | - | 84.55 | 85.54 | 85.09 | 0.701 | 1.644 | |
| D | R | -0.1 | 5 | 9 | - | 88.96 | 84.33 | 86.42 | 0.731 | 1.633 | |
| SSC | A | R | -0.1 | 5 | 3 | - | 86.02 | 86.74 | 86.42 | 0.726 | 1.665 |
| D | R | -0.1 | 5 | 3 | - | 84.55 | 86.74 | 85.75 | 0.712 | 1.664 | |
| D | R | -0.1 | 5 | 3 | - | 82.35 | 78.91 | 80.46 | 0.609 | 1.509 | |
| DPC+SSC | A | P | 0.1 | 0 | - | 2 | 85.29 | 86.14 | 85.76 | 0.713 | 1.651 |
| PSSM+SSC | A | R | 0.0 | 4 | 1 | - | 88.97 | 92.16 | 90.72 | 0.812 | 1.785 |
| A | R | -0.1 | 4 | 1 | - | 89.70 | 89.15 | 89.40 | 0.786 | 1.725 | |
| D | R | 0.0 | 4 | 1 | - | 87.49 | 80.72 | 83.77 | 0.678 | 1.561 | |
| D | R | 0.0 | 4 | 1 | - | 85.29 | 84.93 | 85.09 | 0.700 | 1.628 | |
| DPC+PSSM | A | R | -0.1 | 0 | 0.001 | - | 81.61 | 83.73 | 82.78 | 0.652 | 1.592 |
| DPC+PSSM+SSC | A | P | 0.1 | 0 | - | 2 | 85.29 | 86.14 | 85.76 | 0.713 | 1.651 |
*Validation: A = Leave-one-out; D = Hold-out
Kernel: R = RBF; P = Polynomial; L = Linear
Figure 1ROC curves of the different SVM classifiers. ROC plot of SVMs based on different protein sequence features depicting relative trade-offs between true positive and false positives. The diagonal line (line of no-discrimination) represents a completely random guess. The corresponding area under curve (AUC) is given in brackets in the legends.
Quality estimation of SVM models over random prediction
| Model | Correctly predicted (total 302) | S (%) |
|---|---|---|
| AAC | 227 | 49.869 |
| DPC | 254 | 67.767 |
| PSSM | 270 | 78.653 |
| SSC | 261 | 72.632 |
| DPC-SSC | 259 | 71.297 |
| PSSM-SSC | 274 | 81.247 |
| DPC-PSSM | 243 | 60.772 |
| DPC-PSSM-SSC | 259 | 71.297 |
The table estimates the quality of the models generated from each module as compared to random prediction(S).
Performance on independent datasets
| Model tested | Positive (total 42) | Negative-FABPs (total 25) | Negative-Triabins (total 28) |
|---|---|---|---|
| SSC | 34 | 17 | 28 |
| PSSM | 39 | 25 | 18 |
| PSSM-SSC | 38 | 21 | 28 |
The three best models were used for testing on independent datasets. The numbers show the correctly predicted sequences out of the total given in the first row.
Figure 2Snapshot of LipocalinPred web server sample output. The web server predicts lipocalins based on the two best classifiers, namely based on PSSM profile and the hybrid classifier: PSSM-SSC. The two classifiers may be chosen together for a comparative prediction. The server accepts FASTA formatted sequences and allows user defined thresholds of prediction, ranging from -1.5 to 1.5.