| Literature DB >> 23176080 |
Gregory M Cipriano1, George N Phillips, Michael Gleicher.
Abstract
BACKGROUND: Molecular recognition in proteins occurs due to appropriate arrangements of physical, chemical, and geometric properties of an atomic surface. Similar surface regions should create similar binding interfaces. Effective methods for comparing surface regions can be used in identifying similar regions, and to predict interactions without regard to the underlying structural scaffold that creates the surface.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23176080 PMCID: PMC3585919 DOI: 10.1186/1471-2105-13-314
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Shown are the results for three of the calcium ion binding predictions discussed in the results section. The three examples depict a successful result, a moderate success, and a failure case. In red are areas that the classifier chose as highly likely (>95%estimated probability) to bind to calcium. In lighter orange are areas that have between 40% and 95% probability of binding, with the shade of orange indicating approximately where in that range the estimate fell. In white are areas that were deemed unlikely to bind to calcium. The binding locations of the crystal structures are shown as blue spheres with a green point at the center.
Figure 2This image shows representative examples from the multiple-ligand binding test discussed in the results section. The best and worst examples for each test ligand in the experiment are shown visually. Binding prediciton probability is shown by color on the protein surface: red are areas that the classifier chose as highly likely (>95%estimated probability), orange are areas with between 40% and 95% probability of binding.
Figure 3Shown here are, for one sample point, the disc-shaped patches of each radii used in the functional surface descriptor: 1.6Å, 3.2Å, 4.8Å, 6.4Å and 8Å.
A list of each feature contained within our surface descriptor
| 1 | % Visibility | Percentage of outside world visible from point |
| 2 | Non-Polar Backbone | Distance to the nearest Non-Polar Backbone Atom |
| 3 | Arom. Sidechain | Distance to the nearest Aromatic Sidechain Atom |
| 4 | Aliph. Sidechain | Distance to the nearest Aliphatic Sidechain Atom |
| 5 | N Backbone | Distance to the nearest Nitrogen Backbone Atom |
| 6 | O Backbone | Distance to the nearest Oxygen Backbone Atom |
| 7 | S Backbone | Distance to the nearest Sulpher Sidechain Atom |
| 8 | Amide N Sidechain | Distance to the nearest Amide Nitrogen Sidechain Atom |
| 9 | Amide O Sidechain | Distance to the nearest Amide Oxygen Sidechain Atom |
| 10 | Trp Sidechain | Distance to the nearest Trypophan Sidechain Atom |
| 11 | Hydroxyl Sidechain | Distance to the nearest Hydroxyl Sidechain Atom |
| 12 | Charged O Sidechain | Distance to the nearest Charged Oxygen Sidechain Atom |
| 13 | Charged N Sidechain | Distance to the nearest Charged Nitrogen Sidechain Atom |
| 14 | Anisotropy (1.6 Å) | Patch anisotropy, with radius: 1.6 Å |
| 15 | Anisotropy (3.2 Å) | Patch anisotropy, with radius: 3.2 Å |
| 16 | Anisotropy (4.8 Å) | Patch anisotropy, with radius: 4.8 Å |
| 17 | Anisotropy (6.4 Å) | Patch anisotropy, with radius: 6.4 Å |
| 18 | Anisotropy (8 Å) | Patch anisotropy, with radius: 8 Å |
| 19 | Curvature (1.6 Å) | Patch curvature, with radius: 1.6 Å |
| 20 | Curvature (3.2 Å) | Patch curvature, with radius: 3.2 Å |
| 21 | Curvature (4.8 Å) | Patch curvature, with radius: 4.8 Å |
| 22 | Curvature (6.4 Å) | Patch curvature, with radius: 6.4 Å |
| 23 | Curvature (8 Å) | Patch curvature, with radius: 8 Å |
| 24 | Curvature Var. (1.6 Å) | Variance of curvature within patch of radius: 1.6 Å |
| 25 | Curvature Var. (3.2 Å) | Variance of curvature within patch of radius: 3.2 Å |
| 26 | Curvature Var. (4.8 Å) | Variance of curvature within patch of radius: 4.8 Å |
| 27 | Curvature Var. (6.4 Å) | Variance of curvature within patch of radius: 6.4 Å |
| 28 | Curvature Var. (8 Å) | Variance of curvature within patch of radius: 8 Å |
| 29 | Hydropathy (1.6 Å) | Weighted avg. hydropathy over patch of radius: 1.6 Å |
| 30 | Hydropathy (3.2 Å) | Weighted avg. hydropathy over patch of radius: 3.2 Å |
| 31 | Hydropathy (4.8 Å) | Weighted avg. hydropathy over patch of radius: 4.8 Å |
| 32 | Hydropathy (6.4 Å) | Weighted avg. hydropathy over patch of radius: 6.4 Å |
| 33 | Hydropathy (8 Å) | Weighted avg. hydropathy over patch of radius: 8 Å |
| 34 | Charge (1.6 Å) | Weighted avg. charge over patch of radius: 1.6 Å |
| 35 | Charge (3.2 Å) | Weighted avg. charge over patch of radius: 3.2 Å |
| 36 | Charge (4.8 Å) | Weighted avg. charge over patch of radius: 4.8 Å |
| 37 | Charge (6.4 Å) | Weighted avg. charge over patch of radius: 6.4 Å |
| 38 | Charge (8 Å) | Weighted avg. charge over patch of radius: 8 Å |
| 39 | Hyd. Bond Donor | Distance to nearest potential external hydrogen bond donor |
| 40 | Hyd. Bond Acceptor | Distance to nearest potential external hydrogen bond acceptor |
Note that features 14 - 38 are weighted according to distance (from center vertex) and point area (i.e. the sum of the areas of all adjacent triangles).
Figure 4A visual depiction of Algorithm 2, for training a classifier to recognize the environment surrounding a specific atom, given a corpus of examples of that atom’s binding.
Figure 5Shown here are two possible conformations for Adenosine Triphosphate (ATP). Note that the distances between atoms within the rigid adenine moiety do not change. Distances between non-rigid components, such as those between the ‘C8’ and ‘O3A’ atoms, may change dramatically. As these will be used later to combine atomic predictions, the observed minimum and maximum distances between each pair of atoms are stored during the training phase.
Figure 6Prediction Phase: combining atom surface functions to predict a ligand.
Figure 7An illustration (in 2D) of how our method for grouping samples on the 3D surface works. In this illustration, each circle represents a sample; samples having similar values in feature space are given the same color. The algorithm proceeds as follows: starting with a radius R, identify discs of radius R that have minimum average distance (in feature space) between elements in the disc. Replace the best non-overlapping discs with the sample in the center of each disc. Repeat, each time reducing the size of the disc. When complete, there will still be samples not contained in a disc. Merge those into neighboring discs if their distance from the center sample is less than a threshold T. The resulting center samples are used for surface prediction. In all results, R= 4Å and T=.25.
Results from tests of the atomic predictor on finding the binding location of calcium ions
| ROC Area | 0.92 | 0.57 | 1.00 | 0.94 | 0.95 | 0.97 | 0.96 | 0.97 | 0.96 | 0.98 | 0.97 |
Shown are the 11 proteins used as test cases by Altman et. al [45]. To test these, we first trained a predictor on 100 examples of calcium binding, then evaluated each of the above protein surfaces using this predictor to generate an ROC curve for each test. The number below each PDB code is the area under its respective ROC curve. 1.0 indicates a perfect score, with all true positives found and no false positives. See Figure 1 for illustrations to help interpret these numbers.
Figure 8Shown here is the performance over all test cases of calcium binding (Table2) as a function of the size of the training corpus. Performance is measured by the area under the ROC graph produced by each example. On the top, each test case is shown as a separate line, each with a different color. Note that while with only a few training examples, the correct pocket is found in most tests, a few harder test cases require more training before they can be reliably predicted. On the bottom is the same data, but averaged, with error bars indicating 95% confidence intervals.
Test Ligands and their training sets
| ATP | 1A0I 1A82 1ASZ 1E8X 1FMW 1G5T 1GN8 1JWA 1XEX 2BUP |
| GLC | 1AC0 3ACG 1FAE 1GWM 1H5V 1J0K 1JLX 1K9I 2J0Y 2ZX3 |
| DHT | 1AFS 1DHT 1I37 1I38 1KDK 2PIO 2PIP 3KLM 3L3X 3L3Z |
| HEM | 1AOQ 1B0B 1CC5 1D0C 1DLY 1EW0 1SOX 1MZ4 2HBG 2Z6F |
Listed are the ligands to be used as test cases, the moieties they contain, the ligands found during training which match each moiety, and the total number of proteins used as training examples after post-process culling
| ATP | Phosphate Chain | ATP, 5FA, CH1, CSG, CTP, D3T, | 36 |
| | (PA O1A O2A O3A PB O1B | DCT, DGT, DTP, GTP, TTP | |
| | O2B O3B PG O2G O1G O3G) | … | |
| | Ribose | ATP, 5GP, ACP, ADN, ADP, AMP, | 299 |
| | (C1’ C2’ C3’ C4’ O2’ O3’ O4’) | ANP, AP0, APC, ATG, C5P, FAD, | |
| | | GDP, GNP, GTP, NAD, NAP, NDP, | |
| | | RIB, SAH, SAM, SSA, UDP, ADP, | |
| | | … | |
| | Adenine | ATP, ACO, ACP, ADP, AMP, ANP, | 265 |
| | (C2 C4 C5 C6 C8 N1 N3 N6 N7 N9) | ATG, CMP, COA, FAD, NAD, NAP, | |
| | | NDP, SAH, SAM, … | |
| Glucose | Glucose | GLC | 85 |
| HEM | oxygenated end | HEM, DHE, FDE, FDD, HAS, HCO, | 316 |
| | (O1A O2A CGA CBA CAA C2A | HDD, HEA, HEB, HEC, HEV, HFM, | |
| | CMA C3A C1A CHA C4A NA CHB) | HIF, VEA, VER, … | |
| | non-oxygenated end (1) | HEM, HDD, HDM, HEA, HEB, | 318 |
| | (CHB C1B NB CMB C2B C4B | HEC, HFM, HKL, … | |
| | CHC C3B CAB CBB) | | |
| | non-oxygenated end (2) | HEM, HDD, HDM, HEA, HEB, | 318 |
| | (CHD C4C NC CAC C3C C1C | HEC, HFM, HKL, … | |
| | CHC C2C CBC CMC) | | |
| DHT | (O3 C1-10 C19) | DHT, AE2, AND, C0R, CLR, CPQ, | 40 |
| | | DXC, FFA, HC2, HCY, STR, TES … | |
| | (O17 C8 C9 C11-18) | DHT, AE2, AND, ASD, EST, FFA, | 22 |
| TES, WZA … |
Figure 9Shown here are three confusion matrices, thetop (a) tested using the full feature vector description (listed in Table1), the middle (b) using only the most local features, and the bottom (c) using only geometric features. Each row represents the tests for a ligand classifier run on all test cases. Each column represents an individual testing example, grouped by the ligand the protein is known to bind to. The value in the cell is the area underneath the precision/recall curve produced from that test. A higher value indicates a better match. Green cells indicate true positive results: the predictor found the ligand it was trained for. Purple cells indicate false negatives: the ligand failed to find the ligand it was trained for. Red cells indicate false positives: the predictor found the site of a different ligand. See Figure 2 for illustrations to help interpret these numbers.
Figure 10This image shows the charge (indicated by color, range from dark red (very negative) to dark blue (very positive) of protein 1AYP as computed by APBS. This charge pattern is quite different than any of the others seen in the training set.