| Literature DB >> 20663120 |
Fredrik Johansson1, Hiroyuki Toh.
Abstract
BACKGROUND: Conservation and variation scores are used when evaluating sites in a multiple sequence alignment, in order to identify residues critical for structure or function. A variety of scores are available today but it is not clear how different scores relate to each other.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20663120 PMCID: PMC2920274 DOI: 10.1186/1471-2105-11-388
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Pearson correlation of alignment size (N) and mean conservation score on the CSA dataset.
| 0.97 | Lockless99 | -0.42 | Pei01varw |
| 0.51 | Mayrose04 | -0.50 | Shannon |
| 0.02 | Sander91sp | -0.55 | Caffrey04w |
| -0.08 | Valdar01 | -0.55 | Wang06w |
| -0.11 | Karlin96 | -0.56 | Shannonw |
| -0.18 | Thompson97 | -0.64 | Mihalek07 |
| -0.20 | Liu08w | -0.66 | Capra07w |
| -0.26 | Pei01sp | -0.71 | Zvelebil87 |
| -0.31 | Pei01spw | -0.74 | Wu70 |
| -0.33 | Mirny99 | -0.75 | Taylor86 |
| -0.36 | Pei01var | -0.92 | Zhang08 |
| -0.39 | Williamson95 | -0.92 | Mihalek04 |
Variation scores have been negated into conservation scores.
Figure 1Hierarchical clustering results. Dendrogram obtained from hierarchical clustering of average Spearman correlations on each alignment, using average linking. Each node is labeled by a probability value in percent, given by 1000 iterations of a bootstrap procedure.
Figure 2Performance evaluation. Performance evaluation of catalytic site prediction. Lines show the performance measured by the AUC measure for subsets of the original dataset obtained by setting an upper limit on alignment size (ranging from 10 to 168 sequences in an alignment). The legend is sorted on score performance for the original dataset (equal to the right terminal values of the graph), and also shows the numerical AUC value for this case. Lines and score names are colored acccording to the clustering shown in Figure 1 as; red: cluster A, blue: cluster B, green: other scores.
Example of ranking of a site
| Mirny99 | 1.00 | Caffrey04w | B | 0.36 | |
| Thompson97 | A | 0.71 | Zhang08 | B | 0.23 |
| Zvelebil87 | 0.68 | Mayrose04 | B | 0.20 | |
| Mihalek07 | 0.65 | Shannonw | B | 0.16 | |
| Liu08w | A | 0.65 | Pei01varw | B | 0.16 |
| Sander91sp | A | 0.63 | Wu70 | B | 0.15 |
| Pei01sp | A | 0.63 | Shannon | B | 0.15 |
| Pei01spw | A | 0.63 | Pei01var | B | 0.13 |
| Karlin96 | A | 0.63 | Wang06w | B | 0.11 |
| Taylor86 | 0.58 | Lockless99 | B | 0.10 | |
| Williamson95 | 0.52 | Capra07w | B | 0.09 | |
| Valdar01 | A | 0.49 | Mihalek04 | B | 0.08 |
Normalized rankings of alignment site with index 188 in alignment 1k32_A. The alignment site has the amino acid profile V: 41%, L: 22%, M: 22%, I: 15%.
Scoring methods at an alignment site k
| Symbol frequency | S | Wu70 | |
|---|---|---|---|
| I | Lockless99 | ||
| S | Pei01var(w) | ||
| Stereochemical properties | I | Taylor86 | min|{ |
| I | Zvelebil87 | 0.9 - 0.1 | |
| Symbol entropy | S | Sander91 | |
| S | Shenkin91 | ||
| S | Gerstein95 | ||
| I | Wang06w | ||
| I | Capra07w | ( | |
| Stereochemically sensitive entropy | S | Mirny99 | |
| S | Williamson95 | ||
| S | Caffrey04w | ||
| Substitution matrix | S | Sander91sp | |
| I | Karlin96 | ||
| S | Valdar01 | ||
| S | Pei01sp(w) | ||
| I | Thompson97 | ||
| I | Mihalek07 | ||
| I | Liu08w | ||
| Phylogeny | S | Mihalek04 | |
| I | Zhang08 | ||
| D | Mayrose04 | Rate4Site | |
Methods labeled as S: available in SEALA package at http://github.com/fredrikj, I: implemented for this study and available at http://github.com/fredrikj/bioruby, D: downloaded from the Rate4Site website http://consurf.tau.ac.il. Scores ending with "w" use sequence weighting by Henikoff and Henikoff [9]. Scores ending with "(w)" are used both with and without weights. Explanation of notations are given in Table 4, and further details can be found in the main text.
Notations used in Table 3
| No. of sequences in alignment. | |
|---|---|
| The amino acid in sequence | |
| Sequence distance in percent. | |
| Probability estimated from site | |
| Probability estimated from alignment. | |
| Probability estimated from database. | |
| - | |
| No. of occurences in site | |
| No. of average occurences in a site. | |
| Most common amino acid at | |
| No. of different amino acids at | |
| The BLOSUM62 matrix, containing log-odds ratios ( | |
| The BLOSUM62 matrix of frequencies ( | |