| Literature DB >> 20122268 |
Shirley Wu1, Tianyun Liu, Russ B Altman.
Abstract
BACKGROUND: The emergence of structural genomics presents significant challenges in the annotation of biologically uncharacterized proteins. Unfortunately, our ability to analyze these proteins is restricted by the limited catalog of known molecular functions and their associated 3D motifs.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20122268 PMCID: PMC2833161 DOI: 10.1186/1472-6807-10-4
Source DB: PubMed Journal: BMC Struct Biol ISSN: 1472-6807
Figure 1Overview of functional site discovery approach. Starting from thousands of protein microenvironments, we use k-means clustering to group them into coarse clusters. Each coarse cluster is then hierarchically clustered, and optimal clusters are identified using a scoring function that incorporates knowledge from scientific literature. These clusters are annotated using information from literature, Swiss-Prot records, and PDB HETATM data to produce novel individual site annotations and potentially novel functional motifs.
Test clusters for evaluating functional coherence
| PROSITE pattern | min # proteins | max # proteins |
|---|---|---|
| COPPER_BLUE | 6 | 61 |
| PROTEIN_KINASE_ST | 9 | 1303 |
| ADH_SHORT | 10 | 262 |
| 4FE4S_FERREDOXIN | 11 | 169 |
| TRYPSIN_SER | 13 | 399 |
| EF_HAND | 19 | 1248 |
We tested the functional coherence measure using clusters associated with six functional motifs from PROSITE. The minimum number of proteins is the smallest cluster size we used for that particular PROSITE pattern, derived from training sets from existing FEATURE models. The maximum number is the total number of proteins in Swiss-Prot annotated to that pattern.
Figure 2Functional coherence of random, functional, and dilute functional clusters. a) We show median functional coherence scores for random clusters, as well as clusters derived from functional site patterns. "PROSITE min" refers to the minimum cluster size for each PROSITE pattern cluster in Table 1 (derived from training sets used for existing FEATURE models [16]), while "PROSITE max" refers to the maximum size of each cluster. The PROSITE subsets were randomly sampled from the max PROSITE clusters, while the random clusters were randomly sampled from all Swiss-Prot proteins. The median functional coherence for the random clusters is clearly much lower than that for clusters derived from PROSITE. b) We plotted functional coherence as a function of percent signal. We decreased functional signal by randomly replacing members of the six "PROSITE min" clusters with either structurally similar proteins (left), or random proteins (right). Functional coherence decreases exponentially as the proportion of biological signal decreases.
Rediscovered functional sites
| cluster ID | Size | FC | Function |
|---|---|---|---|
| Clust1-Sub13 | 5 | 16.93 | Copper binding, multicopper oxidase type with C1H2 coordination |
| Clust1-Sub52 | 7 | 3.73 | Zinc binding, C2H2 and multi-HIS type |
| Clust1-Sub53 | 13 | 11.26 | Zinc binding, 1 CYS + multi-HIS + ASP/GLU + H2O coordination, with several sites being dinuclear. |
| Clust1-Sub118 | 10 | 3.11 | Zinc binding, C3H1 type |
| Clust1-Sub257 | 7 | 10.65 | Associated with TYR phosphatases and adjacent to active site |
| Clust10-Sub26 | 7 | 7.36 | Metal binding with four sulfur coordination - iron (2FE2S) with 2 CYS and zinc binding with 4 CYS |
| Clust21-Sub5 | 5 | 3.15 | Tyrosine phosphatase active site |
| Clust21-Sub17 | 5 | 4.45 | Iron binding, 2FE2S with additional CYS present |
| Clust21-Sub27 | 7 | 11.86 | Tyrosine phosphatase active site, enriched for polyfunctional proteins |
| Clust22-Sub159 | 5 | 3.11 | Iron binding, 4FE4S type with additional LYS and PRO nearby |
| Clust23-Sub44 | 10 | 15.20 | Cytochrome C heme binding, C2H2 type |
| Clust23-Sub46 | 17 | 18.53 | Cytochrome C heme binding, high molecular weight cytochromes |
| Clust23-Sub80 | 5 | 13.77 | Cytochrome C heme binding, C2H2 type |
| Clust23-Sub83 | 7 | 3.15 | Cytochrome C heme binding, additional CYS, MET, or LYS, 1 HIS, and at least 1 PRO present |
| Clust29-Sub110 | 6 | 4.65 | Zinc binding, C3H1 type |
| Clust30-Sub15 | 6 | 4.45 | Iron binding, 2FE2S oxidoreductase type with 3-4 CYS present |
| Clust30-Sub24 | 6 | 4.45 | Iron binding, 2FE2S oxidoreductase type with 4 CYS |
| Clust30-Sub57 | 5 | 14.86 | Iron binding, 2FE2S ferredoxin type with 3-4 CYS |
| Clust30-Sub110 | 6 | 12.81 | Iron binding, 2FE2S ferredoxin type with 4-5 CYS |
| Clust30-Sub122 | 5 | 15.77 | Iron binding, 2FE2S ferredoxin type with 4-5 CYS |
| Clust30-Sub160 | 10 | 24.48 | Iron binding, mixed 2FE2S and 4FE4S with 4-6 CYS or MET |
| Clust31-Sub14 | 9 | 5.03 | Ser/Thr protein kinase associated site corresponding to domain IX, adjacent to substrate recognition site |
| Clust32-Sub46 | 7 | 3.78 | Zinc binding, multinuclear site (3-4) with 7+ CYS |
| Clust32-Sub62 | 5 | 4.51 | Zinc binding with 4 CYS |
| Clust32-Sub208 | 6 | 3.03 | Zinc binding with 4 CYS |
| Clust32-Sub222 | 15 | 14.95 | Metal binding (zinc, iron) with 4 CYS |
| Clust32-Sub382 | 7 | 5.87 | Zinc binding with 4 CYS and additional ASP and GLU nearby |
| Clust33-Sub49 | 6 | 17.69 | Copper binding, blue copper C2H2 type |
| Clust33-Sub60 | 16 | 3.71 | Zinc binding, mixed C2H2 and C3H1 type |
| Clust33-Sub63 | 5 | 4.78 | Zinc binding, mixed C2H2 and C3H1 type |
| Clust33-Sub83 | 17 | 3.82 | Zinc binding, majority C2H2 type, C3H1 have additional HIS nearby |
| Clust33-Sub99 | 13 | 3.65 | Metal binding, iron has C1H4 type, zinc has C2H2 type with additional HIS nearby |
| Clust33-Sub109 | 8 | 6.13 | Zinc binding, C3H1 type |
| Clust33-Sub156 | 6 | 4.29 | Zinc binding, C2H2 type |
| Clust33-Sub237 | 10 | 6.26 | Zinc binding, C3H1 type |
| Clust33-Sub343 | 6 | 4.44 | Zinc binding, C3H1 type |
These clusters represent functional annotations that are already known in that all or the vast majority of sites are annotated for that function. FC = functional coherence.
Novel annotations for individual proteins
| PDB ID | Site residue | cluster ID | Annotation |
|---|---|---|---|
| CYS274 | Clust1-Sub53 | Zinc binding | |
| CYS98 | Clust1-Sub53 | Zinc binding | |
| CYS181 | Clust1-Sub53 | Zinc binding | |
| CYS278 | Clust22-Sub159 | Iron binding |
These annotations represent predictions for proteins in clusters where the function can be readily identified. The functional coherence for Clust1-Sub53 is 11.26 and functional coherence for Clust22-Sub159 is 3.11.
Potentially novel functional sites
| cluster ID | Size | FC | Putative annotation | Distinguishing features |
|---|---|---|---|---|
| Clust4-Sub23 | 5* | 8.79 | Structural role | Extended beta sheet environment with repeated CYS flanked by PHE. * Several sites are adjacent to one another, and may be involved in disulfide bonds. |
| Clust5-Sub70 | 12 | 3.07 | TYR phosphorylation site, possibly autocatalytic | 2/3 of proteins are TYR kinases with multiple phosphorylation sites. Environment characterized by loop containing CYS, MET, and TYR. |
| Clust6-Sub240 | 5 | 4.33 | Associated with ligand binding | 80% of sites are near a bound ligand. |
| Clust8-Sub25 | 11 | 4.12 | Structural role | Inward facing CYS on a surface accessible helix surrounded by an abundance of aliphatic, hydrophobic sidechains. |
| Clust8-Sub352 | 6 | 4.18 | Structural role | Helical CYS in the vicinity of 1 HIS and several aliphatic, hydrophobic sidechains. |
| Clust15-Sub152 | 6 | 4.43 | Associated with enzymatic activity | All proteins are enzymes. Environment contains multiple ARG and occasionally HIS. |
| Clust21-Sub48 | 9* | 3.06 | Associated with WD repeat motif | Environment characterized by beta sheets and the presence of another CYS. * Several sites are adjacent to one another. |
| Clust24-Sub17 | 5 | 4.34 | Functional role | Environment contains an ASP, a GLU, and usually at least one LYS, all charged and polar residues. |
| Clust25-Sub19 | 5 | 6.32 | Associated with sugar kinases | Beta sheet environment with multiple sulfur- containing residues. |
| Clust31-Sub18 | 5 | 7.00 | Protein binding | 80% of proteins are protein-binding. Environment characterized by helical CYS and an opposing TRP residue. |
| Clust36-Sub127 | 5 | 5.04 | Functional role | Environment is solvent exposed with an ASP and LYS forming a possible triad with the CYS. |
| Clust39-Sub58 | 5 | 18.34 | Associated with viral proteins | Sparse environment containing TRP, TYR, THR, and ARG, all polar and mostly hydrophobic residues. |
These predictions represent clusters where the function is not obvious but reasonable evidence exists for a coherent functional theme, even if it is in a structural role. FC = functional coherence.
Figure 3Two distinct clusters for copper binding. (a) Clust33-Sub49 consists of copper-binding environments from blue copper proteins involved in electron transport. (b) Clust1-Sub13 consists of copper-binding environments from multicopper oxidase proteins, so named because they contain multiple copper centers. The mode of binding for both types of proteins is similar. All microenvironment images were generated using PyMol [59].
Figure 4Different types of zinc binding sites. Our cluster selection approach divides several clusters into smaller groups of zinc binding site environments. Many of these represent different types of zinc binding sites: (from left to right) coordination by four CYS residues, coordination by three CYS and one HIS residue, coordination by two CYS and two HIS residues (C2H2 type), and coordination of multiple zinc ions by many diverse residues, including CYS, HIS, ASP, GLU, and water.
Figure 5Potentially novel zinc binding sites in Clust1-Sub53. We predict zinc binding sites for (from left to right) structures 1GY8:A (no Swiss-Prot accession number) at CYS274, 1UC2:A [Swiss-Prot:O59245] at CYS98, and 1NYQ:A [Swiss-Prot:Q8NW68] at CYS181 based on zinc binding for other microenvironments in this cluster. Features supporting this prediction include the presence of multiple HIS residues and occasionally ASP or GLU, all known to coordinate zinc.
Figure 6Clust8-Sub25 - Novel microenvironment motif with a potential structural role. Five representative microenvironments from a total of 11 are shown. This set of microenvironments is characterized by the central CYS based in a helix with the sidechain surrounded by an abundance of aliphatic, hydrophobic sidechains (ILE, LEU, VAL). Cysteines are often important for stabilizing protein structures, and the absence of reactive sidechains combined with the striking similarity between members of this cluster suggest a potential structural role for this microenvironment.
Figure 7Clust5-Sub70 - potential TYR autophosphorylation site. This cluster contains 12 microenvironments, eight of which belong to tyrosine kinases. In the eight kinase microenvironments, the CYS is on a loop next to a helix containing a TYR residue; the environment as a whole is surface-exposed and contains additional sulfur-containing residues. From left to right, we show 1K9A:A [Swiss-Prot:P32577], in which TYR416 is annotated as a putative autophosphorylation site (by similarity), 1LUF:A [Swiss-Prot:Q62838], in which TYR831 is not annotated as a potential phosphorylation site, and 1Z45:A [Swiss-Prot:P04397], a yeast aldose 1-epimerase, which is not a TYR kinase. There is, however, a surface-exposed TYR in a loop environment with an additional sulfur-containing residue.
Figure 8Clust36-Sub127 - Novel microenvironment motif with a potential functional role. This microenvironment motif is surface exposed and contains an ASP (red) and a LYS (blue) around the central CYS (yellow) in a potentially functional triad in four out of five cases. As these are all residues known to participate in chemical reactions, it is possible there is an active role for this recurring microenvironment.
Figure 9Example annotation output for Clust21-Sub27, TYR phosphatase active sites. The HTML output for the cluster annotation method is shown for a tyrosine phosphatase active site cluster. A summary page showing general cluster information and top significant terms for each annotation type contains links to more detailed information for each type of annotation, including lists of proteins mapped to each annotation term. Detailed literature output shows the proteins and PMIDs contributing to each annotation term and abstract text for each PMID.