| Literature DB >> 20525785 |
Bernd W Brandt1, K Anton Feenstra, Jaap Heringa.
Abstract
Many protein families contain sub-families with functional specialization, such as binding different ligands or being involved in different protein-protein interactions. A small number of amino acids generally determine functional specificity. The identification of these residues can aid the understanding of protein function and help finding targets for experimental analysis. Here, we present multi-Harmony, an interactive web sever for detecting sub-type-specific sites in proteins starting from a multiple sequence alignment. Combining our Sequence Harmony (SH) and multi-Relief (mR) methods in one web server allows simultaneous analysis and comparison of specificity residues; furthermore, both methods have been significantly improved and extended. SH has been extended to cope with more than two sub-groups. mR has been changed from a sampling implementation to a deterministic one, making it more consistent and user friendly. For both methods Z-scores are reported. The multi-Harmony web server produces a dynamic output page, which includes interactive connections to the Jalview and Jmol applets, thereby allowing interactive analysis of the results. Multi-Harmony is available at http://www.ibi.vu.nl/ programs/shmrwww.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20525785 PMCID: PMC2896201 DOI: 10.1093/nar/gkq415
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Hypothetical alignment of three sub-families to illustrate the SH scores (range from 0 to 1) and mR weights (range from −1 to 1)
| Alignment position | Distance matrix | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 1 | 2 | 3 | 4 | 1 | 2 | 3 | 1 | 2 | 3 | |
| Group 1 | ||||||||||||||||||
| seq1 | R | E | L | A | A | K | K | A | – | 4 | 4 | 6 | 7 | 7 | 7 | |||
| seq2 | R | E | L | A | F | K | K | I | – | 4 | 3 | 6 | 7 | 7 | ||||
| seq3 | R | E | A | A | Y | R | K | L | 4 | 4 | – | 5 | 6 | 6 | 6 | |||
| seq4 | R | E | A | A | F | R | K | M | 4 | 3 | – | 6 | 6 | 7 | ||||
| Group 2 | ||||||||||||||||||
| seq1 | H | N | V | A | Y | R | K | K | 6 | 6 | 5 | – | 3 | 5 | 5 | |||
| seq2 | H | N | V | F | Y | R | K | K | 7 | 7 | 6 | – | 4 | 4 | 4 | |||
| seq3 | H | N | S | A | F | K | K | K | 5 | 6 | 5 | 4 | – | 6 | ||||
| Group 3 | ||||||||||||||||||
| seq1 | H | S | F | F | Y | R | K | Q | 7 | 7 | 6 | 4 | 6 | – | ||||
| seq2 | H | S | M | F | F | R | K | R | 7 | 6 | 6 | 5 | 5 | – | ||||
| seq3 | H | S | M | F | Y | K | K | S | 7 | 5 | 5 | – | ||||||
| SH | 0.42 | 0.00 | 0.00 | 0.57 | 0.87 | 0.99 | 1.00 | 0.00 | ||||||||||
| mR | 1.00 | 1.00 | 0.67 | 1.00 | −0.42 | −0.19 | 0.00 | 0.50 | ||||||||||
The distance matrix is used by mR to find ‘nearest hits’ (within group; in bold italic) and ‘nearest misses’ (between groups; in bold) for each sequence.
Properties of our seven data sets used for benchmark comparison of the algorithms
| Data set | Number of classes | Average (SD) class size | Max, min class size | Number of sites | Site information | PDB ref | ‘True’ sites |
|---|---|---|---|---|---|---|---|
| GPCR | 77 | 26.8 (34) | 189, 3 | 214 | ligand | 1GZM | T94, T97, E113, G114, A117, T118, G121, L125, C167, L172, F203, V204, M207, F208, H211, Y268, A269, A272, A292, F293, K296 |
| GPCR-190 | 39 | 4.9 (3.8) | 21, 2 | like ‘GPCR’ | |||
| LacI | 15 | 3.6 (2.5) | 12, 2 | 339 | ligand and DNA | 1EFA | T5, L6, S16, Y17, Q18, R22, N25, Q26, H29, Q54, A57, S61, L73, A75, P76, I79, N125, P127, D149, S191, S193, W220, N246, Q248, Y273, D274, T276, F293 |
| Ras/Ral | 2 | 44.5 (24.5) | 69, 20 | 218 | protein | 5P21 | I24, Q25, D30, E31, D33, I36, E37, Q43, L53, M67, Q70, D92 |
| Rab5/Rab6 | 2 | 5.0 ( | 4, 6 | 163 | protein | 1R2Q | K42, G43, Q44, H46, E47, F48, Q49, E50, S51, H83, A86, M88, Y90, G92, A93, Q94, E117, L118, Q119, R120, Q121, A122, S123, P124, N125, I126, V127, K183 |
| AQP/GLP | 2 | 30.0 ( | 48, 12 | 430 | protein | 1FX8 | L21, W48, V52, A65, H66, L67, V71, T137, Y138, P139, N140, P141, L159, I163, I187, G195, P196, L197, G199, F200, A201, M202 |
| Smad | 2 | 10.0 ( | 12, 8 | 211 | protein | 1KHX | L263, Q264, T267, Q284, Q294, P295, L297, T298, S308, E309, A323, V325, M327, I341, F346, P360, Q364, R365, Y366, W368, N381, R427, T430, S460, V461, R462, C463, M466 |
Data sets are the G-protein coupled receptors (GPCR) and a smaller version (GPCR-190), the LacI family of transcription factors, Ras super-family of small GTP-ases (Rab5 versus Rab6; Rab versus Ral), the aquaporins versus glycerol porins (AQP/GLP) and the Smad family of transcription factors [more details in (5,8)].
Figure 1.Validation results for the SH and mR methods. ProteinKeys, PROUST-II, SDPpred v.2 and Xdet are shown for comparison. Results obtained by the different methods were averaged over all data sets weighted by the number of positives. (A) Box plots showing the distribution (as minimum, lower quartile, median, upper quartile and maximum) of ranks of positive sites. Lower is better. (B) Precision/recall (PR) curves showing the relative performance of the methods at different coverage (recall). Higher is better.
Validation for detection of specificity sites by SH and mR scored as area under curve (AUC) for the PR plots versus gold-standard specificity sites in the 22 data sets, 7 sets as defined in Table 2 and 15 sets obtained from Chakrabarti and Panchenko (15)
| Dataset | cbm9 | cd00 | cd00 | cd00 | cd00 | cd00 | cd00 | cd00 | CN- | GPCR | GPCR | GST | IDH/ | LacI | MDH/ | AQP/ | nucl | rab | ras/ | ricin | serine | Smad | Aver |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 120 | 264 | 333 | 363 | 365 | 423 | 985 | myc | 190 | IMDH | LDH | GLP | cycl. | 5/6 | ral | Wt'd | ||||||||
| # positives | 7 | 3 | 3 | 12 | 6 | 10 | 4 | 3 | 11 | 21 | 21 | 9 | 14 | 28 | 1 | 23 | 2 | 28 | 12 | 21 | 2 | 29 | |
| mR | 0.161 | 0.058 | 0.006 | 0.301 | 0.010 | 0.055 | 0.204 | 0.329 | 0.037 | 0.246 | 0.347 | 0.156 | 0.050 | 0.266 | 0.063 | 0.213 | 0.540 | 0.186 | 0.078 | 0.719 | 0.310 | ||
| mR Z | 0.161 | 0.058 | 0.006 | 0.301 | 0.010 | 0.055 | 0.204 | 0.329 | 0.037 | 0.252 | 0.347 | 0.156 | 0.050 | 0.282 | 0.063 | 0.216 | 0.539 | 0.186 | 0.078 | 0.721 | 0.312 | ||
| SH. | 0.074 | 0.054 | 0.003 | 0.287 | 0.008 | 0.119 | 0.080 | 0.198 | 0.067 | 0.486 | 0.489 | 0.242 | 0.048 | 0.124 | 0.125 | 0.249 | 0.413 | 0.540 | 0.194 | 0.261 | 0.713 | 0.330 | |
| SH Z | 0.074 | 0.054 | 0.003 | 0.287 | 0.008 | 0.119 | 0.080 | 0.198 | 0.067 | 0.489 | 0.242 | 0.207 | 0.125 | 0.413 | 0.540 | 0.194 | 0.261 | 0.703 | |||||
| ProteinKeys | 0.049 | 0.008 | 0.203 | 0.010 | 0.010 | 0.002 | 0.034 | 0.027 | 0.377 | 0.505 | 0.483 | 0.065 | 0.005 | 0.011 | 0.364 | 0.092 | 0.006 | 0.287 | |||||
| PROUST-II | 0.349 | 0.079 | 0.012 | 0.055 | 0.011 | 0.016 | 0.049 | 0.058 | 0.122 | 0.308 | | 0.446 | 0.089 | 0.111 | 0.015 | 0.187 | 0.305 | 0.455 | 0.378 | 0.256 | 0.723 | 0.258 | |
| SDPpred v.2 | 0.122 | 0.017 | 0.126 | 0.509 | 0.508 | 0.146 | 0.242 | 0.413 | 0.416 | 0.357 | 0.201 | 0.542 | 0.522 | 0.333 | |||||||||
| Xdet | 0.106 | 0.080 | 0.366 | 0.011 | 0.103 | 0.196 | 0.387 | 0.086 | 0.125 | | 0.117 | 0.100 | 0.190 | 0.033 | 0.169 | 0.054 | 0.350 | 0.398 | 0.173 | 0.105 | 0.688 | 0.234 | |
| Xdet sup | 0.209 | 0.106 | 0.019 | 0.346 | 0.012 | 0.101 | 0.275 | | 0.402 | 0.129 | 0.207 | 0.208 | 0.292 | 0.346 | 0.545 | 0.193 | 0.677 | 0.279 | |||||
| Average | 0.172 | 0.072 | 0.026 | 0.280 | 0.010 | 0.088 | 0.136 | 0.286 | 0.078 | 0.344 | 0.448 | 0.318 | 0.086 | 0.204 | 0.103 | 0.208 | 0.304 | 0.468 | 0.465 | 0.206 | 0.314 | 0.691 | 0.298 |
aNucleotidyl cyclase.
bThe GPCR data set is above the maximum of 1000 sequences for these methods.
cSupervised by using subgroupings.
A higher AUC corresponds to better performance. For comparison, predictions by ProteinKeys, PROUST-II, SDPpred v.2 and Xdet are also shown. Best-scoring methods for each data set are in bold. The final column list the average AUCs per method weighted by number of positives, and the bottom row the averages per data set.
Figure 2.An example of the multi-Harmony output. (A) The main output table, sorted by SH score and filtered on SH score (0.5) and high mR weight (0.8). Only ALA278 at position 17 in the alignment is not a confirmed functional residue. The columns with arrows can be sorted. Most of these columns can also be filtered to display only those alignment positions that satisfy the user-supplied thresholds. (B) The output view in Jalview. Groups are outlined in the alignment and filtered positions (from the output table) are marked in the annotation track ‘Filtered 1’ with a tooltip detailing the filter like ‘Positions passing criteria [score 0.5; weight 0.8] are indicated’. (C) View of the 3D context using Jmol with the protein coloured by mR weights, and filtered residues (from the output table) labelled and highlighted as space-filling spheres. Colouring by SH scores is also possible.