| Literature DB >> 22783946 |
Jose C A Santos1, Houssam Nassif, David Page, Stephen H Muggleton, Michael J E Sternberg.
Abstract
BACKGROUND: There is a need for automated methods to learn general features of the interactions of a ligand class with its diverse set of protein receptors. An appropriate machine learning approach is Inductive Logic Programming (ILP), which automatically generates comprehensible rules in addition to prediction. The development of ILP systems which can learn rules of the complexity required for studies on protein structure remains a challenge. In this work we use a new ILP system, ProGolem, and demonstrate its performance on learning features of hexose-protein interactions.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22783946 PMCID: PMC3458898 DOI: 10.1186/1471-2105-13-162
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The positive dataset, composed of 80 non-redundant protein-hexose binding sites
| Glucose | 1BDG | GLC-501 | 1ISY | GLC-1471 |
| | 1EX1 | GLC-617 | 1J0Y | GLC-1601 |
| | 1GJW | GLC-701 | 1JG9 | GLC-2000 |
| | 1GWW | GLC-1371 | 1K1W | GLC-653 |
| | 1H5U | GLC-998 | 1KME | GLC-501 |
| | 1HIZ | GLC-1381 | 1MMU | GLC-1 |
| | 1HIZ | GLC-1382 | 1NF5 | GLC-125 |
| | 1HKC | GLC-915 | 1NSZ | GLC-1400 |
| | 1HSJ | GLC-671 | 1PWB | GLC-405 |
| | 1HSJ | GLC-672 | 1Q33 | GLC-400 |
| | 1I8A | GLC-189 | 1RYD | GLC-601 |
| | 1ISY | GLC-1461 | 1S5M | AGC-1001 |
| | 1SZ2 | BGC-1001 | 1SZ2 | BGC-2001 |
| | 1U2S | GLC-1 | 1UA4 | GLC-1457 |
| | 1V2B | AGC-1203 | 1WOQ | GLC-290 |
| | 1Z8D | GLC-901 | 2BQP | GLC-337 |
| | 2BVW | GLC-602 | 2BVW | GLC-603 |
| | 2F2E | AGC-401 | | |
| Galactose | 1AXZ | GLA-401 | 1MUQ | GAL-301 |
| | 1DIW | GAL-1400 | 1NS0 | GAL-1400 |
| | 1DJR | GAL-1104 | 1NS2 | GAL-1400 |
| | 1DZQ | GAL-502 | 1NS8 | GAL-1400 |
| | 1EUU | GAL-2 | 1NSM | GAL-1400 |
| | 1ISZ | GAL-461 | 1NSU | GAL-1400 |
| | 1ISZ | GAL-471 | 1NSX | GAL-1400 |
| | 1JZ7 | GAL-2001 | 1OKO | GLB-901 |
| | 1KWK | GAL-701 | 1OQL | GAL-265 |
| | 1L7K | GAL-500 | 1OQL | GAL-267 |
| | 1LTI | GAL-104 | 1PIE | GAL-1 |
| | 1R47 | GAL-1101 | 1S5D | GAL-704 |
| | 1S5E | GAL-751 | 1S5F | GAL-104 |
| | 1SO0 | GAL-500 | 1TLG | GAL-1 |
| | 1UAS | GAL-1501 | 1UGW | GAL-200 |
| | 1XC6 | GAL-9011 | 1ZHJ | GAL-1 |
| | 2GAL | GAL-998 | | |
| Mannose | 1BQP | MAN-402 | 1KZB | MAN-1501 |
| | 1KLF | MAN-1500 | 1KZC | MAN-1001 |
| | 1KX1 | MAN-20 | 1KZE | MAN-1001 |
| | 1KZA | MAN-1001 | 1OP3 | MAN-503 |
| | 1OUR | MAN-301 | 1QMO | MAN-302 |
| 1U4J | MAN-1008 | 1U4J | MAN-1009 |
The table lists the protein’s PDB ID and the hexose ligand considered.
Protein binding-sites that bind non-hexose ligands
| Hexose-like ligands | |||||
| 1A8U | 4320, 4323 | BEZ-1 | 1AI7 | 6074, 6077 | IPH-1 |
| 1AWB | 4175, 4178 | IPD-2 | 1DBN | pyranose ring | GAL-102 |
| 1EOB | 3532, 3536 | DHB-999 | 1F9G | 5792, 5785, 5786 | ASC-950 |
| 1G0H | 4045, 4048 | IPD-292 | 1JU4 | 4356, 4359 | BEZ-1 |
| 1LBX | 3941, 3944 | IPD-295 | 1LBY | 3944, 3939, 3941 | F6P-295 |
| 1LIU | 15441, 15436, 15438 | FBP-580 | 1MOR | pyranose ring | G6P-609 |
| 1NCW | 3406, 3409 | BEZ-601 | 1P5D | pyranose ring | G1P-658 |
| 1T10 | 4366, 4361, 4363 | F6P-1001 | 1U0F | pyranose ring | G6P-900 |
| 1UKB | 2144, 2147 | BEZ-1300 | 1X9I | pyranose ring | G6Q-600 |
| 1Y9G | 4124, 4116, 4117 | FRU-801 | 2B0C | pyranose ring | G1P-496 |
| 2B32 | 3941, 3944 | IPH-401 | 4PBG | pyranose ring | BGP-469 |
| Other ligands | |||||
| 11AS | 5132 | ASN-1 | 11GS | 1672, 1675 | MES-3 |
| 1A0J | 6985 | BEN-246 | 1A42 | 2054, 2055 | BZO-555 |
| 1A50 | 4939, 4940 | FIP-270 | 1A53 | 2016, 2017 | IGP-300 |
| 1AA1 | 4472, 4474 | 3PG-477 | 1AJN | 6074, 6079 | AAN-1 |
| 1AJS | 3276, 3281 | PLA-415 | 1AL8 | 2652 | FMN-360 |
| 1B8A | 7224 | ATP-500 | 1BO5 | 7811 | GOL-601 |
| 1BOB | 2566 | ACO-400 | 1D09 | 7246 | PAL-1311 |
| 1EQY | 3831 | ATP-380 | 1IOL | 2674, 2675 | EST-400 |
| 1JTV | 2136, 2137 | TES-500 | 1KF6 | 16674, 16675 | OAA-702 |
| 1RTK | 3787, 3784 | GBS-300 | 1TJ4 | 1947 | SUC-1 |
| 1TVO | 2857 | FRZ-1001 | 1UK6 | 2142 | PPI-1300 |
| 1W8N | 4573, 4585 | DAN-1649 | 1ZYU | 1284, 1286 | SKM-401 |
| 2D7S | 3787 | GLU-1008 | 2GAM | 11955 | NGA-502 |
| 3PCB | 3421, 3424 | 3HB-550 | |||
The table lists the protein’s PDB ID, the ligand considered and the specified cavity center. 22 ligands are similar to hexoses in shape and/or size. The cavity center is the centroid of the reported PDB atom numbers.
Non-binding sites negative dataset, composed of random surface pockets that do not bind any ligand
| 1A04 | 1424, 2671 | 1A0I | 1689, 799 | 1A22 | 2927 |
| 1AA7 | 579 | 1AF7 | 631, 1492 | 1AM2 | 1277 |
| 1ARO | 154, 1663 | 1ATG | 1751 | 1C3G | 630, 888 |
| 1C3P | 1089, 1576 | 1DXJ | 867, 1498 | 1EVT | 2149, 2229 |
| 1FI2 | 1493 | 1KLM | 4373, 4113 | 1KWP | 1212 |
| 1QZ7 | 3592, 2509 | 1YQZ | 4458, 4269 | 1YVB | 1546, 1814 |
| 1ZT9 | 1056, 1188 | 2A1K | 2758, 3345 | 2AUP | 2246 |
| 2BG9 | 14076, 8076 | 2C9Q | 777 | 2CL3 | 123, 948 |
| 2DN2 | 749, 1006 | 2F1K | 316, 642 | 2G50 | 26265, 31672 |
| 2G69 | 248, 378 | 2GRK | 369, 380 | 2GSE | 337, 10618 |
| 2GSH | 6260 |
The table lists the protein’s PDB ID and the specified cavity center, computed as the centroid of the reported PDB atom numbers.
Excerpt of the background knowledge for protein 1BDG in Prolog
| center_coords(p1BDG, p(27.0,22.1,64.9)). |
|---|
| has_aminoacid(p1BDG, a64, phe). |
| has_aminoacid(p1BDG, a85, leu). |
| has_aminoacid(p1BDG, a86, gly). |
| has_aminoacid(p1BDG, a87, gly). |
| has_atom(p1BDG, a64, ’CD2’, p(22.4,13.3,65.5)). |
| has_atom(p1BDG, a64, ’CE2’, p(21.6,14.0,66.4)). |
| has_atom(p1BDG, a85, ’C’, p(24.6,25.9,57.4)). |
| has_atom(p1BDG, a85, ’O’, p(24.6,24.8,57.8)). |
| has_atom(p1BDG, a86, ’N’, p(24.8,27.0,58.3)). |
| has_atom(p1BDG, a86, ’CA’, p(24.9,26.8,59.7)). |
Since 1BDG is a hexose-binding protein, center_coords/2 predicate states the coordinates of the hexose binding center. The has_aminoacid and has_atom predicates state the coordinates of the amino acids and atoms in a neighborhood of 10 Å of the binding site center.
Background knowledge predicates for the two binding site representations
| atom-only | center_coords/2, has_atom/4, dist/4 |
| amino acid | has_aminoacid/3, atom_to_center_dist/4, |
| atom_to_atom_dist/6, diff_aminoacid/2 |
The /N indicates the arity of the background knowledge predicate. For instance, given a binding site, the center_coords predicate returns the coordinates of its center (1 input + 1 output arguments = 2).
An amino acid representation hypothesis
| bind(A):- | |
|---|---|
| | has_aminoacid(A,B,asp), |
| | atom_to_atom_dist(B,B,’N’,’OD2’,4.6,0.5), |
| | has_aminoacid(A,C,leu), |
| | has_aminoacid(A,D,cys), |
| atom_to_center_dist(B,’C’,7.6,0.5). |
English translation: A protein is hexosebinding if the N and OD2 atoms of an aspartic acid are 4.6+/-0.5 Angstroms away from each other and the C atom of this aspartic acid is 7.6+/-0.5 Angstroms away from the binding center, a leucine and a cysteine are also present.
Atom-only representation 10-folds cross-validation predictive accuracies for ProGolem using different recall selection methods
| 1 | 43.8 | 56.3 | 87.5 |
| 2 | 62.5 | 93.8 | 78.5 |
| 3 | 81.3 | 87.5 | 87.5 |
| 4 | 56.3 | 50.0 | 43.8 |
| 5 | 68.8 | 68.8 | 81.3 |
| 6 | 37.5 | 56.3 | 81.3 |
| 7 | 56.3 | 62.5 | 75.0 |
| 8 | 68.8 | 68.8 | 81.3 |
| 9 | 62.5 | 81.3 | 62.5 |
| 10 | 56.3 | 62.5 | 68.8 |
| Mean | 59.4 | 68.8 | 74.8 |
| Std Dev | 12.6 | 14.4 | 13.4 |
10-folds cross-validation predictive accuracies for Aleph, ProGolem and SVM
| 1 | 50.0 | 75.0 | 56.3 | 75.0 | 81.3 |
| 2 | 68.8 | 81.3 | 68.8 | 81.3 | 87.5 |
| 3 | 62.5 | 68.8 | 68.8 | 93.8 | 87.5 |
| 4 | 50.0 | 56.3 | 68.8 | 75.0 | 75.0 |
| 5 | 75.0 | 81.3 | 56.3 | 81.3 | 75.0 |
| 6 | 68.8 | 87.5 | 81.3 | 87.5 | 87.5 |
| 7 | 75.0 | 81.3 | 75.0 | 81.3 | 93.8 |
| 8 | 93.8 | 81.3 | 75.0 | 93.8 | 87.5 |
| 9 | 68.8 | 75.0 | 75.0 | 81.3 | 75.0 |
| 10 | 56.3 | 56.3 | 87.5 | 81.3 | 62.5 |
| Mean | 66.9 | 74.4 | 71.3 | 83.2 | 81.3 |
| Std Dev | 13.2 | 10.8 | 9.8 | 6.6 | 9.3 |
The 1 besides Aleph and ProGolem stands for the atom-only representation and the 2 for the amino acid representation. SVM uses a different representation (see text).
Positive and negative examples covered by each reported ProGolem rule
| | 37: 1BDG, 1BQP, 1DZQ, 1HKC, 1HSJ_2, 1ISY, 1ISZ, 1ISZ_2, | |
| | 1J0Y, 1JG9, 1JZ7, 1KLF, 1MMU, 1MUQ, 1NSU, 1NSX, 1NSZ, | |
| 1 | 1OKO, 1OP3, 1OQL, 1OQL_2, 1OUR, 1PIE, 1Q33, 1S5M, 1SZ2, | 4: 1AWB, 1W8N, 2B0C, 2B32 |
| | 1SZ2_2, 1TLG, 1U2S, 1U4J, 1U4J_2, 1UA4, 1UAS, 1WOQ, | |
| | 2BQP, 2BVW, 2BVW_2 | |
| | 24: 1DJR, 1EUU, 1HIZ, 1HSJ, 1HSJ_2, 1KWK, 1KX1, 1KZA, | |
| 2 | 1KZB, 1KZC, 1KZE, 1L7K, 1MUQ, 1NS8, 1NSM, 1NSU, 1NSX, | 0 |
| | 1NSZ, 1PWB, 1S5D, 1S5E, 1SO0, 1TLG, 1XC6 | |
| 3 | 30: 1DIW, 1DJR, 1EUU, 1HIZ, 1ISZ, 1KX1, 1KZA, 1KZB, 1KZC, | 0 |
| 1KZE, 1L7K, 1LTI, 1NS0, 1NS2, 1NS8, 1NSM, 1NSU, 1NSX, | ||
| 1NSZ, 1OKO, 1OQL_2, 1OUR, 1PWB, 1QMO, 1S5D, 1S5E, 1S5F, | ||
| 1SO0, 1U2S, 2GAL | ||
| 4 | 7: 1HSJ, 1HSJ_2, 1KME, 1RYD, 1S5M, 1TLG, 1UGW | 0 |
| 5 | 6: 1HIZ_2, 1KWK, 1QMO, 1U4J, 1U4J_2, 1XC6 | 0 |
| | 18: 1ISY, 1ISY_2, 1ISZ, 1ISZ_2, 1NF5, 1OKO, 1OQL, 1PIE, | |
| 6 | 1R47, 1SO0, 1SZ2, p1SZ2_2, 1TLG, 1U4J, 1U4J_2, 1UAS, | 0 |
| 2BVW, 2BVW_2 |
Figure 1 Visualization of the second ProGolem rule instantiated with protein 1HIZ (covered by the rule). The hexose is the glucose molecule to the left, with a pink backbone. To the right, with a white backbone, are the amino acids asn and glu, in closer contact with the hexose. The dotted black lines highlight the distances between the atoms in the amino acids and the center of the hexose.