| Literature DB >> 20130034 |
Martin Madera1, Ryan Calmus, Grant Thiltgen, Kevin Karplus, Julian Gough.
Abstract
MOTIVATION: Some first order methods for protein sequence analysis inherently treat each position as independent. We develop a general framework for introducing longer range interactions. We then demonstrate the power of our approach by applying it to secondary structure prediction; under the independence assumption, sequences produced by existing methods can produce features that are not protein like, an extreme example being a helix of length 1. Our goal was to make the predictions from state of the art methods more realistic, without loss of performance by other measures.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20130034 PMCID: PMC2828123 DOI: 10.1093/bioinformatics/btq020
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.The STR2 alphabet. This 13-state alphabet uses DSSP hydrogen bond definitions and is defined strictly from DSSP output. The main difference is that STR2 subdivides the DSSP class E (β-sheet) into seven classes: A M P, anti-parallel, mixed or parallel β-strand, hydrogen bonded to two partners; Y Z, anti-parallel edge strand residue, bonded and non-bonded, respectively; Q, parallel edge strand, both bonded and non-bonded residues; and E, all other β-sheet residues, typically β-bulges. STR2 groups together DSSP classes H (α-helix) and I (π-helix) into a single STR2 class H. The remaining five classes are identical to DSSP: G, 310 helix; T, turn; S, bend; C, coil; and B, β-bridge.
Fig. 2.Improvement due to k-mer model. (A) The improvements can be seen by comparing the two blocks of secondary structure sequences: above are the results from sampling columns independently and below are results from correlated sampling using the k-mer model (10). The STR2 profile is shown graphically above the alignments, and the true secondary structure is shown at the bottom, and in (B) which has the same colouring scheme showing the elements on the PDB structure (1aba). N.B. The quality of individual rows is important, not the alignment.
The accuracy of predictions as measured by standard performance measures: SOV on 3-states, Q3, Q13
| SOV (%) | Q3 (%) | Q13 (%) | ||
|---|---|---|---|---|
| Alignment | Profile only | 81.2 ± 0.2 | 77.3 ± 0.2 | |
| Exact Viterbi | 79.3 ± 0.2 | 75.3 ± 0.2 | 53.4 ± 0.3 | |
| Exact post. | 80.5 ± 0.2 | 76.2 ± 0.2 | 54.2 ± 0.3 | |
| Sampled post. | 55.2 ± 0.3 | |||
| Single | Profile only | 71.3 ± 0.2 | 65.8 ± 0.2 | |
| sequence | Exact Viterbi | 72.4 ± 0.2 | 64.5 ± 0.2 | 43.1 ± 0.3 |
| Exact post. | 73.8 ± 0.2 | 65.4 ± 0.2 | 43.3 ± 0.2 | |
| Sampled post. | 44.2 ± 0.2 |
The highest accuracy in each column is shown in bold, and the standard error of the mean is shown after each number.
Quality of predictions
| Alignment | 0.271 | |
| Single sequence | 0.189 |
X is the profile and M is the joint profile +k-mer model. The probabilities are reported per residue; that is, the quantity shown is [∏P(y)]1/, where the product is over all real sequences y in the test set and L is the sum of their lengths.
Fig. 3.Distribution of scores for samples from profile and corresponding joint model. Each dot represents a sequence. The axes are the two components of the joint model M. The red cloud (bottom) represents 50 000 samples from the profile X; the blue cloud (top) represents 50 000 samples from the joint model M; circle is the real sequence. The profile used is the same as in Figure 2.
The five most encouraged and discouraged k-mers for multiple alignments
| Encouraged | Discouraged | ||
|---|---|---|---|
| Mean score | Mean score | ||
| MM | 3.4 | MA | −13.0 |
| EE | 3.0 | PM | −11.7 |
| ZE | 2.2 | HY | −11.4 |
| GG | 2.1 | GY | −11.2 |
| YZ | 1.9 | TP | −11.1 |
| PMS | 5.7 | CTZ | −9.1 |
| HQE | 5.1 | QEZ | −9.0 |
| TMA | 4.8 | YTC | −8.9 |
| YQY | 4.7 | CTC | −8.3 |
| ZQZ | 4.1 | ZEQ | −8.3 |
| YEQY | 7.6 | CGGC | −7.5 |
| HQBB | 7.2 | CGGS | −7.0 |
| QEZM | 6.9 | CGGT | −7.0 |
| QBBQ | 6.8 | CHHT | −6.6 |
| BTQM | 6.6 | CGGH | −6.3 |