| Literature DB >> 18522726 |
Thomas Lingner1, Peter Meinicke.
Abstract
BACKGROUND: Classification of protein sequences is a central problem in computational biology. Currently, among computational methods discriminative kernel-based approaches provide the most accurate results. However, kernel-based methods often lack an interpretable model for analysis of discriminative sequence features, and predictions on new sequences usually are computationally expensive.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18522726 PMCID: PMC2438326 DOI: 10.1186/1471-2105-9-259
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Word score profiles for positive test sequences of SCOP superfamily 7.3.5. Word score profiles of the first 5 positive test sequences associated with experiment 1 (SCOP superfamily 7.3.5: omega toxin-like) using word length K = 6. Amino acid sequences are mapped to the x-axis while the y-axis corresponds to discriminative word scores. Word score values are centered at position 4 of the overlapping words. See Equation (12) in section "Discriminant function in feature space" for details about calculation of word scores.
Overview of detection performance for several methods.
| Method | avg. ROC | avg. ROC50 | avg. mRFP | avg. # SV |
| 0.8705 | 0.3153 | 0.1065 | 1798 | |
| 0.8926 | 0.3814 | 0.0833 | 1673 | |
| 0.8964 | 0.4040 | 0.0813 | 1628 | |
| 0.9013 | 0.4257 | 0.0801 | 1604 | |
| 0.9032 | 0.4413 | 0.0795 | 1591 | |
| 0.9044 | 0.4473 | 0.0778 | 1591 | |
| 0.9036 | 0.4454 | 0.0785 | 1600 | |
| 0.9024 | 0.4470 | 0.0801 | 1607 | |
| 0.9018 | 0.4516 | 0.0815 | 1614 | |
| 0.9012 | 0.4528 | 0.0830 | 1620 | |
| LA-eig | 0.9348 | 0.6614 | 0.0489 | 2640 |
| ODH Monomer | 0.9135 | 0.4554 | 0.0729 | 1601 |
| SVM pairwise | 0.9008 | 0.3986 | 0.0810 | 2355 |
| Mismatch (5,1) | 0.8852 | 0.3815 | 0.0949 | 2943 |
| Spectrum (3) | 0.8239 | 0.2939 | 0.1535 | 2350 |
| Spectrum {1,2} | 0.8919 | 0.3913 | 0.0798 | 1560 |
| Spectrum {1,2,3} | 0.8957 | 0.4094 | 0.0766 | 1711 |
| Spectrum {1,2,3,4} | 0.8981 | 0.4180 | 0.0769 | 1882 |
Performance evaluation results of the word correlation approach (WCM) using several word lengths K = 1, ..10 in comparison to local alignment kernel (LA-eig) [10], Monomer Distance Histograms (ODH Monomer) [14], SVM pairwise [6], Mismatch string kernel [8], Spectrum kernel [9] and the combination of Spectrum kernels for different word lengths (see section "Results").
Figure 2Discriminant of SCOP superfamily 7.3.5 in the WCM space. Word correlation matrix representation of the discriminant weight vector of superfamily 7.3.5 (omega toxin-like) after training using K = 6 (see text). Rows and columns correspond to occurrences of amino acids at two particular word positions for the first and second occurrence, respectively. Red (blue) matrix elements represent large positive (negative) discriminant weight values according to the color bar on the right hand side.
Ordered list of discriminative words for experiment 1.
| # | Score | Word | Count |
| 1 | 7.066 | CCSGSC | 3 |
| 2 | 6.930 | CCSRKC | 2 |
| 3 | 6.419 | CRSGKC | 4 |
| 4 | 5.451 | CCRSCN | 2 |
| 5 | 5.354 | GRSGKC | 1 |
| 6 | 5.215 | CSRKCN | 2 |
| 7 | 5.142 | GRGSRC | 1 |
| 8 | 4.979 | CSGRGS | 1 |
| 9 | 4.812 | CCTGSC | 4 |
| 10 | 4.789 | SYNCCR | 2 |
List of 10 most discriminative words for positive training sequences of experiment 1 according to SCOP superfamily 7.3.5 using word length K = 6. Words are sorted according to their word score. The first and second column correspond to rank and score of a word, respectively. The third column contains the word as amino acid sequence in IUPAC one-letter code. In the fourth column, the number of occurrences of a particular word in the positive training sequences are shown.
Ordered list of discriminative features.
| # | Family 1.27.1.1 | Family 1.27.1.2 | Family 1.36.1.2 | Family 1.36.1.5 |
| 1 | Leu@5, Leu@5 | Leu@6, Leu@6 | Thr@1, Val@5 | Ala@1, Lys@5 |
| 2 | Leu@6, Leu@6 | Leu@5, Leu@5 | Thr@2, Val@6 | Ala@2, Lys@6 |
| 3 | Leu@1, Leu@1 | Leu@1, Leu@1 | Val@1, Ser@2 | asp@2, asp@2 |
| 4 | Leu@2, Leu@2 | Leu@2, Leu@2 | Val@2, Ser@3 | asp@3, asp@3 |
| 5 | Leu@4, Leu@4 | Leu@4, Leu@4 | Val@5, Ser@6 | asp@1, asp@1 |
| 6 | Leu@3, Leu@3 | Leu@3, Leu@3 | Val@4, Ser@5 | asp@4, asp@4 |
| 7 | Leu@1, Leu@5 | Leu@1, Leu@5 | Val@3, Ser@4 | asp@6, asp@6 |
| 8 | Leu@2, Leu@6 | Leu@2, Leu@6 | Val@2, Thr@6 | asp@5, asp@5 |
| 9 | Glu@6, Glu@6 | Glu@1, Glu@1 | Val@1, Thr@5 | Ala@1, Leu@2 |
| 10 | gly@1, gly@1 | Glu@2, Glu@2 | Ser@1, Thr@4 | Ala@2, Leu@3 |
List of 10 most discriminative features for four superfamilies associated with the SCOP class "All alpha proteins". Features are sorted in descending order according to their absolute discriminative weight (not shown). The first column corresponds to the rank of a feature and the succeeding columns contains the description of the feature in the word correlation feature space in terms of a pair of amino acids (in IUPAC three-letter code) at particular word positions. Features that are associated with negative discriminative weights are printed with lowercase first letters.
Figure 3Comparison of ROC and ROC50 performance for Spectrum method and WCM method. The figure shows the mean ROC and ROC50 performance over 54 experiments for the Spectrum method and the word correlation method (WCM) using word length K = 1, .., 6.