| Literature DB >> 27348631 |
Elisa Boari de Lima1,2, Wagner Meira2, Raquel Cardoso de Melo-Minardi2.
Abstract
As increasingly more genomes are sequenced, the vast majority of proteins may only be annotated computationally, given experimental investigation is extremely costly. This highlights the need for computational methods to determine protein functions quickly and reliably. We believe dividing a protein family into subtypes which share specific functions uncommon to the whole family reduces the function annotation problem's complexity. Hence, this work's purpose is to detect isofunctional subfamilies inside a family of unknown function, while identifying differentiating residues. Similarity between protein pairs according to various properties is interpreted as functional similarity evidence. Data are integrated using genetic programming and provided to a spectral clustering algorithm, which creates clusters of similar proteins. The proposed framework was applied to well-known protein families and to a family of unknown function, then compared to ASMC. Results showed our fully automated technique obtained better clusters than ASMC for two families, besides equivalent results for other two, including one whose clusters were manually defined. Clusters produced by our framework showed great correspondence with the known subfamilies, besides being more contrasting than those produced by ASMC. Additionally, for the families whose specificity determining positions are known, such residues were among those our technique considered most important to differentiate a given group. When run with the crotonase and enolase SFLD superfamilies, the results showed great agreement with this gold-standard. Best results consistently involved multiple data types, thus confirming our hypothesis that similarities according to different knowledge domains may be used as functional similarity evidence. Our main contributions are the proposed strategy for selecting and integrating data types, along with the ability to work with noisy and incomplete data; domain knowledge usage for detecting subfamilies in a family with different specificities, thus reducing the complexity of the experimental function characterization problem; and the identification of residues responsible for specificity.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27348631 PMCID: PMC4922564 DOI: 10.1371/journal.pcbi.1005001
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Example of MSA for a given reference structure’s pocket.
Data sources and corresponding identifiers for the protein similarity matrices employed in this work.
| Data Source | Name | Description |
|---|---|---|
| UniProt ID for protein A | ||
| UniProt ID for protein B | ||
| Global sequence alignment score | ||
| Local sequence alignment score | ||
| Structural alignment size | ||
| Structural alignment identity percentage | ||
| Structural alignment TM-score | ||
| Structural signature array distances | ||
| Conserved gene neighborhood score | ||
| Gene fusion score | ||
| Co-occurrence score | ||
| Co-expression score | ||
| Difference in molecular weights | ||
| Difference in isoelectric points | ||
| Difference in aliphatic residue contents | ||
| Difference in aromatic residue contents | ||
| Difference in polar residue contents | ||
| Difference in charged residue contents | ||
| Difference in basic residue contents | ||
| Difference in acid residue contents | ||
| Amino acid composition array distance | ||
| Difference in instability indices | ||
| Difference in GRAVY indices | ||
| Number of common annotations | ||
| Number of common terms | ||
| Putative active site identity percentage | ||
| Putative active site BLOSUM62 score |
Structures used as templates for modeling the family sequences.
| Family | Subfamily | Structure | CSA Residues |
|---|---|---|---|
| Adenylate cyclases | 1AB8:A | R1029 | |
| Guanylate cyclases | 3ET6:A | - | |
| Ser/Thr kinases | 2CPK:E | D166, K168, E170, N171, T201 | |
| Tyr kinases | 1U46:A | D252, A254, R256, N257, V292 | |
| Chymotrypsins | 1AB9:(A, B, C, D) | H57, D102, G193, S195, G196 | |
| Elastases | 1EST:A | H57, D102, G193, S195, G196 | |
| Trypsins | 5PTP:A | H57, D102, G193, S195, G196, S214 | |
| - | 2Y7F:A, 3FA5:A, 3CHV:A, 3E49:A, 3E02:A, 3LOT:A, 3C6C:A | - |
Structures are presented in format PDB code:chain (e.g., 1AB8:A indicates chain A of PDB structure 1AB8). Residues are presented in format residuePosition (e.g., R1029 represents an Arg residue in position 1029 of the corresponding structure).
Structures used as templates for modeling the SFLD superfamily sequences.
| Superfamily | Subgroup | Structure | CSA Residues |
|---|---|---|---|
| crotonase-like | 1MJ3:A | A98, S118, H122, G141, E164, G172 | |
| enolase | 7ENL:A | E168, E211, K345, K396 | |
| glucarate dehydratase | 1ECQ:A | K205, K207, N237, H339 | |
| mandelate racemase | 1MDR:A | K166, D270, H297, E317 | |
| mannonate dehydratase | 3QKE:A | - | |
| methylaspartate ammonia-lyase | 1KKR:A | - | |
| muconate cycloisomerase | 3DG6:A | - |
Structures are presented in format PDB code:chain (e.g., 1AB8:A indicates chain A of PDB structure 1AB8). Residues are presented in format residuePosition (e.g., R1029 represents an Arg residue in position 1029 of the corresponding structure).
Comparison of Mutual Information (MI) values for the clusterings obtained by each technique for the studied protein families.
| Family | Clusters | GP System | ASMC |
|---|---|---|---|
| 3 | 22.35 | 22.16 | |
| 6 | 16.13 | 14.11 | |
| 3 | 102.94 | 67.46 | |
| 7 | 50.70 | 45.99 | |
| 4 | 17.71 | 16.58 | |
| 11 | 12.09 | 10.59 | |
| 7 | 36.51 |
* This value refers to the seven clusters defined in [1] by manipulating ASMC’s output.
Data combinations which yielded the best results for the nucleotidyl cyclases in five runs of the GP system.
| Clusters | Run | Equation |
|---|---|---|
| 2 | 1 | 4 |
| 2 | ||
| 3 | ||
| 4 | 4 | |
| 3 | 1 | |
| 6 | 3 | 5 |
Fig 2Nucleotidyl cyclase division into two clusters by the GP system.
Subfigure (a) shows the active site logo for the adenylate cyclase cluster, while (b) shows that for the guanylate cyclase cluster.
Most important residues for the two nucleotidyl cyclase clusters produced by the GP system.
| Cluster | Residues |
|---|---|
Listed in decreasing order of partial MI value. Residues in bold correspond to known SDPs. Subscripted positions correspond to those in PDB structure 3ET6:A.
Fig 3DUF849 division into seven clusters produced by manually altering ASMC’s hierarchical clustering in [1].
Subfigures (a) through (g) show the active site logos for clusters G1 through G7, respectively.
Substrate nature in each group.
| Group | Substrates |
|---|---|
| hydrophobic and non-charged polar | |
| KAH | |
| β-ketoadipate | |
| benzoylacetate and β-ketohexanoate | |
| hydrophobic and polar | |
| mixed BKACE | |
| not BKACE | |
| negatively charged | |
| positively charged | |
| not BKACE, presenting decarboxylation activity | |
| not BKACE |
Fig 4DUF849 division into seven clusters by the GP system.
Subfigures (a) through (g) show the active site logos for clusters I through VII, respectively.
Data combinations which yielded the best results for the protein kinases in five runs of the GP system.
| Clusters | Run | Equation |
|---|---|---|
| 2 | 1 | |
| 3 | 2, 4 | |
| 7 | 3 |
Fig 5Protein kinase division into two clusters by the GP system.
Subfigure (a) shows the active site logo for the cluster consisting mainly of Ser/Thr kinases, while (b) shows the logo for the cluster of Tyr kinases combined with the EGFR subcluster.
Most important residues for the two protein kinase clusters produced by the GP system.
| Cluster | Residues |
|---|---|
Listed in decreasing order of partial MI value. Residues in bold correspond to known SDPs. Subscripted positions correspond to those in PDB structure 1U46:A.
Data combinations which yielded the best results for the serine proteases in five runs of the GP system.
| Clusters | Run | Equation |
|---|---|---|
| 4 | 1 | |
| 11 | 1 | 2 |
Distribution of the crotonases among families.
| Family | Amount |
|---|---|
| enoyl-CoA hydratase | 1,507 |
| methylglutaconyl-CoA hydratase | 269 |
| 1,4-dihydroxy-2-napthoyl-CoA synthase | 217 |
| delta(3,5)-delta(2,4)-dienoyl-CoA isomerase | 201 |
| 1,2-epoxyphenylacetyl-CoA isomerase | 143 |
| dodecenoyl-CoA delta-isomerase (mitochondrial) | 87 |
| dodecenoyl-CoA delta-isomerase (peroxisomal) | 65 |
| diffusible signal factor (DSF) synthase | 55 |
| crotonobetainyl-CoA hydratase | 47 |
| polyketide biosynthesis enoyl-CoA hydratase | 40 |
| feruloyl-CoA hydratase/lyase | 33 |
| methylmalonyl-CoA decarboxylase | 30 |
Distribution of families among the twelve crotonase superfamily clusters produced by the GP system.
| Cluster | Size | Family | Amount |
|---|---|---|---|
| 29 | methylmalonyl-CoA decarboxylase | 29/30 | |
| 55 | diffusible signal factor (DSF) synthase | 55/55 | |
| 58 | dodecenoyl-CoA delta-isomerase (peroxisomal) | 58/65 | |
| 68 | polyketide biosynthesis enoyl-CoA hydratase | 35/40 | |
| feruloyl-CoA hydratase/lyase | 33/33 | ||
| 84 | dodecenoyl-CoA delta-isomerase (mitochondrial) | 84/87 | |
| 178 | 1,2-epoxyphenylacetyl-CoA isomerase | 143/143 | |
| enoyl-CoA hydratase | 34/1,507 | ||
| dodecenoyl-CoA delta-isomerase (peroxisomal) | 1/65 | ||
| 201 | delta(3,5)-delta(2,4)-dienoyl-CoA isomerase | 201/201 | |
| 217 | 1,4-dihydroxy-2-napthoyl-CoA synthase | 217/217 | |
| 253 | methylglutaconyl-CoA hydratase 2 | 252/269 | |
| polyketide biosynthesis enoyl-CoA hydratase | 1/40 | ||
| 286 | enoyl-CoA hydratase | 227/1,507 | |
| crotonobetainyl-CoA hydratase | 47/47 | ||
| polyketide biosynthesis enoyl-CoA hydratase | 4/40 | ||
| dodecenoyl-CoA delta-isomerase (peroxisomal) | 3/65 | ||
| methylglutaconyl-CoA hydratase 2 | 3/269 | ||
| dodecenoyl-CoA delta-isomerase (mitochondrial) | 1/87 | ||
| methylmalonyl-CoA decarboxylase | 1/30 | ||
| 404 | enoyl-CoA hydratase | 404/1,507 | |
| 861 | enoyl-CoA hydratase | 842/1,507 | |
| methylglutaconyl-CoA hydratase 2 | 14/269 | ||
| dodecenoyl-CoA delta-isomerase (peroxisomal) | 3/65 | ||
| dodecenoyl-CoA delta-isomerase (mitochondrial) | 2/87 |
Distribution of the enolases among subgroups and families.
| Subgroup | Family | Amount |
|---|---|---|
| enolase | enolase | 2,492 |
| mandelate racemase | D-galactonate dehydratase | 474 |
| rhamnonate dehydratase | 224 | |
| L-fuconate dehydratase | 183 | |
| D-tartrate dehydratase | 98 | |
| L-talarate/galactarate dehydratase | 98 | |
| muconate cycloisomerase | dipeptide epimerase | 448 |
| o-succinylbenzoate synthase | 370 | |
| N-succinylamino acid racemase 2 | 70 | |
| glucarate dehydratase | glucarate dehydratase | 193 |
| mannonate dehydratase | mannonate dehydratase | 84 |
| methylaspartate ammonia-lyase | methylaspartate ammonia-lyase | 57 |
Distribution of families among the twelve enolase superfamily clusters produced by the GP system.
| Cluster | Size | Family | Amount |
|---|---|---|---|
| 80 | mannonate dehydratase | 80/84 | |
| 87 | rhamnonate dehydratase | 87/224 | |
| 92 | D-tartrate dehydratase | 92/98 | |
| 94 | L-talarate/galactarate dehydratase | 94/98 | |
| 123 | rhamnonate dehydratase | 123/224 | |
| 140 | o-succinylbenzoate synthase | 140/370 | |
| 165 | L-fuconate dehydratase | 165/183 | |
| 177 | glucarate dehydratase | 177/193 | |
| 443 | dipeptide epimerase | 363/448 | |
| N-succinylamino acid racemase 2 | 61/70 | ||
| o-succinylbenzoate synthase | 19/370 | ||
| 456 | D-galactonate dehydratase | 456/474 | |
| 942 | enolase | 513/2,492 | |
| o-succinylbenzoate synthase | 204/370 | ||
| dipeptide epimerase | 82/448 | ||
| methylaspartate ammonia-lyase | 57/57 | ||
| L-fuconate dehydratase | 18/183 | ||
| D-galactonate dehydratase | 17/474 | ||
| glucarate dehydratase | 16/193 | ||
| rhamnonate dehydratase | 12/224 | ||
| N-succinylamino acid racemase 2 | 9/70 | ||
| D-tartrate dehydratase | 6/98 | ||
| L-talarate/galactarate dehydratase | 4/98 | ||
| mannonate dehydratase | 4/84 | ||
| 1992 | enolase | 1,979/2,492 | |
| o-succinylbenzoate synthase | 7/370 | ||
| dipeptide epimerase | 3/448 | ||
| rhamnonate dehydratase | 2/224 | ||
| D-galactonate dehydratase | 1/474 |