| Literature DB >> 21034432 |
Nora C Toussaint1, Christian Widmer, Oliver Kohlbacher, Gunnar Rätsch.
Abstract
BACKGROUND: String kernels are commonly used for the classification of biological sequences, nucleotide as well as amino acid sequences. Although string kernels are already very powerful, when it comes to amino acids they have a major short coming. They ignore an important piece of information when comparing amino acids: the physico-chemical properties such as size, hydrophobicity, or charge. This information is very valuable, especially when training data is less abundant. There have been only very few approaches so far that aim at combining these two ideas.Entities:
Mesh:
Substances:
Year: 2010 PMID: 21034432 PMCID: PMC2966294 DOI: 10.1186/1471-2105-11-S8-S7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performances of kernels utilizing sequential structure and/or AA properties on three MHC alleles
| KERNEL | A*2301 | B*5801 | A*0201 | |||
|---|---|---|---|---|---|---|
| WD | 0.7307 | (0.0900) | 0.9314 | (0.0279) | 0.9485 | (0.0076) |
| Poly ( | (0.0808) | 0.9428 | (0.0336) | 0.9354 | (0.0111) | |
| Poly ( | 0.7964 | (0.0727) | 0.8778 | (0.0637) | 0.9052 | (0.0070) |
| Poly ( | 0.8220 | (0.0442) | 0.4948 | (0.0560) | 0.4729 | (0.0246) |
| RBF ( | 0.8277 | (0.0904) | 0.9396 | (0.0303) | 0.9345 | (0.0114) |
| RBF ( | 0.7847 | (0.0787) | 0.9235 | (0.0347) | 0.9157 | (0.0072) |
| RBF ( | 0.8204 | (0.0864) | 0.9509 | (0.0317) | (0.0072) | |
| WD-Poly ( | 0.7879* | (0.0858) | 0.9406* | (0.0319) | 0.9495* | (0.0084) |
| WD-Poly ( | 0.7983* | (0.0902) | 0.9499* | (0.0348) | 0.9483 | (0.0073) |
| WD-Poly ( | 0.8307* | (0.1077) | 0.9491* | (0.0224) | 0.9490* | (0.0070) |
| WD-RBF ( | 0.8133* | (0.0806) | (0.0265) | 0.9486* | (0.0051) | |
| WD-RBF ( | 0.7782* | (0.1222) | 0.9487* | (0.0434) | 0.9500* | (0.0074) |
| WD-RBF ( | (0.0993) | (0.0265) | (0.0067) | |||
auROCs and standard deviation were determined in two times nested 5-fold cross-validation. Best (bold) and second-best (underlined) performances per MHC allele are highlighted. An asterisk marks performance improvement due to the proposed modifications.
Figure 1Learning Curve Analysis on MHC allele A*0201. Shown are areas under the ROC curves averaged over 100 different test splits (30%) and for increasing numbers of training examples (up to 70%). The training part was used for training and model selection using 5-fold cross-validation.
Figure 2Performance of WD and WD-RBF ( The pie chart displays the number of alleles for which the WD (green) and the WD-RBF (red) performed best, respectively, and the number of alleles for which they performed equally (blue).
Comparison of kernels for l-mer content with their AA-property enhanced counterparts.
| Method | auROC50 | #Wins |
|---|---|---|
| Spectrum ( | 15.2% | 7/54 |
| Spectrum-RBF ( | 42.1% | 45/54 |
| Mismatch ( | 42.3% | 13/54 |
| Mismatch-RBF ( | 43.6% | 36/54 |
| Profile ( | 82.1% | 3/54 |
| Profile-RBF ( | 82.2% | 10/54 |
Comparison of the three kernels proposed in [11,21,22], with their AA-property enhanced counterparts for remote homology detection of 54 protein families. auROC50 is the average auROC50 score and #Wins the number of families for which each method outperforms its counterpart (Spectrum vs. Spectrum-RBF, Mismatch vs. Mismatch-RBF, Profile vs. Profile-RBF). The kernels taking advantage of AA properties lead to a higher average accuracy in all three cases (p-Values: 6.92 10−8 for spectrum, 0.0045 for mismatch, and 1.0 for profile kernels). For l and τ we use the published parameter settings. For σ we chose the best result among σ = {0.1,1,10,100,1000}.