| Literature DB >> 22144981 |
Zainab Abu Deeb1, Donald A Adjeroh, Bing-Hua Jiang.
Abstract
Aim. To develop a new invariant descriptor for the characterization of protein surfaces, suitable for various analysis tasks, such as protein functional classification, and search and retrieval of protein surfaces over a large database. Methods. We start with a local descriptor of selected circular patches on the protein surface. The descriptor records the distance distribution between the central residue and the residues within the patch, keeping track of the number of particular pairwise residue cooccurrences in the patch. A global descriptor for the entire protein surface is then constructed by combining information from the local descriptors. Our method is novel in its focus on residue-specific distance distributions, and the use of residue-distance co-occurrences as the basis for the proposed protein surface descriptors. Results. Results are presented for protein classification and for retrieval for three protein families. For the three families, we obtained an area under the curve for precision and recall ranging from 0.6494 (without residue co-occurrences) to 0.6683 (with residue co-occurrences). Large-scale screening using two other protein families placed related family members at the top of the rank, with a number of uncharacterized proteins also retrieved. Comparative results with other proposed methods are included.Entities:
Year: 2011 PMID: 22144981 PMCID: PMC3227456 DOI: 10.1155/2011/918978
Source DB: PubMed Journal: Int J Biomed Imaging ISSN: 1687-4188
Figure 1Protein structures for a sample protein (PDB id: 2UDI). (a) Secondary structure elements—-helixes (magenta), -sheets (gold), and turns (gray); (b) two chains: chain E (blue), chain I (green); (c) surface and 3D shape for chain E; (d) surface and 3D shape for chain I; (e) quaternary structure for the protein. Figures are produced using PMV [16].
Figure 2Schematic diagram for the protein surface characterization using an invariant descriptor. Protein structures in the figure are produced using PMV [3].
Figure 3Variation of classification rate (CR) with size of training set using the proposed descriptors DD2 (a), RC (b), and DRC (c). Results are shown for the average over 10 runs, using logistic regression as the classifier.
Figure 4Variation of classification rate (CR) with size of testing set using the proposed descriptors DD2 (a), RC (b), and DRC (c). Results are shown for the average over 10 runs, using logistic regression as the classifier.
Figure 5Summary classification performance using n-fold cross validation (the x-axis is for varying n).
Figure 6Ranking and retrieval performance for the proposed methods. (a) Enrichment plot for screening protein structures using the proposed descriptors. Results are average for 5 query proteins from cell division protein kinase 2 family (Group 3), using DATASET-A (416 protein chains). (b) Average precision and recall for three queries, one for each group in DATASET-A. DD1 corresponds to the distance distribution proposed in [11], as described in Section 2 (see Section 4.5 on comparison with related methods).
(a) DRC on query protein 1UDI chain I (Group 1)
| Protein PDB ID | Chain | Rank | Distance |
|---|---|---|---|
| 1UDI | I | 1 | 0 |
| 2ZHX | B | 2 | 2.1306 |
| 1LQM | B | 3 | 2.3509 |
| 1LQG | C | 4 | 2.4589 |
| 2UUG | C | 5 | 2.5353 |
| 1EUI | C | 7 | 2.5920 |
| 2ZHX | L | 8 | 2.6104 |
| 1UGH | I | 10 | 2.6349 |
| 2UGI | A | 15 | 2.6809 |
| 1UGI | E | 16 | 2.6872 |
| 2ZHX | H | 19 | 2.6969 |
| 2ZHX | D | 21 | 2.7006 |
| 2ZHX | N | 22 | 2.7017 |
| 2ZHX | J | 23 | 2.7141 |
| 1UGI | G | 25 | 2.7261 |
| 1EMJ | A | 42 | 2.7758 |
| 2BOO | A | 45 | 2.7808 |
| 1UGI | D | 47 | 2.7868 |
| 2OWR | B | 50 | 2.7952 |
| 2J8X | D | 61 | 2.8129 |
| 1LQG | D | 70 | 2.8263 |
| 1Q3F | A | 90 | 2.8533 |
| 2UUG | D | 99 | 2.8675 |
| 1UGI | A | 101 | 2.8689 |
| 2OWR | C | 110 | 2.8749 |
| 2ZHX | F | 116 | 2.885 |
| 2OWQ | B | 129 | 2.8977 |
| 1SSP | E | 141 | 2.9115 |
| 1UGI | C | 142 | 2.9117 |
| 2ZHX | A | 147 | 2.915 |
(b) DRC on query protein 1QKN chain A (Group 2)
| Protein PDB ID | Chain | Rank | Distance |
|---|---|---|---|
| 1QKN | A | 1 | 0 |
| 2J7X | A | 2 | 1.3769 |
| 2J7Y | A | 3 | 1.5753 |
| 1QKM | A | 4 | 1.6793 |
| 1NDE | A | 5 | 1.7368 |
| 2GIU | A | 6 | 1.7371 |
| 1L2I | A | 7 | 1.7460 |
| 1U3R | B | 8 | 1.7670 |
| 3ERD | A | 9 | 1.7683 |
| 3OS9 | A | 10 | 1.7715 |
| 2IOG | A | 11 | 1.7738 |
| 3LTX | C | 12 | 1.7742 |
| 1YIM | A | 13 | 1.7854 |
| 3ERT | A | 14 | 1.7966 |
| 1U3Q | D | 15 | 1.8009 |
| 1YY4 | A | 16 | 1.8090 |
| 2OUZ | A | 17 | 1.8126 |
| 1YIN | A | 18 | 1.8152 |
| 1XP6 | A | 19 | 1.8260 |
| 2AYR | A | 20 | 1.8269 |
| 3OS8 | D | 21 | 1.8311 |
| 2QH6 | A | 22 | 1.8312 |
| 3OSA | A | 23 | 1.8367 |
| 1L2J | A | 24 | 1.8385 |
| 2JJ3 | A | 25 | 1.8438 |
| 1G50 | A | 26 | 1.8490 |
| 3OS8 | A | 27 | 1.8509 |
| 2FSZ | A | 28 | 1.8518 |
| 2QGW | A | 29 | 1.8604 |
| 1UOM | A | 30 | 1.8683 |
(c) DRC on query protein 1YKR chain A (Group 3)
| Protein PDB ID | Chain | Rank | Distance |
|---|---|---|---|
| 1YKR | A | 1 | 0 |
| 2UZO | A | 2 | 1.0396 |
| 2R3O | A | 3 | 1.0810 |
| 3PXY | A | 4 | 1.0846 |
| 3PY1 | A | 5 | 1.0940 |
| 2WMA | A | 6 | 1.1389 |
| 2IW6 | A | 7 | 1.1609 |
| 3NS9 | A | 8 | 1.2043 |
| 3IGG | A | 9 | 1.2141 |
| 2WFY | A | 10 | 1.2179 |
| 2C5Y | A | 11 | 1.2270 |
| 3DDP | A | 12 | 1.2280 |
| 2J9M | A | 13 | 1.2284 |
| 2R3J | A | 14 | 1.2374 |
| 2R3L | A | 15 | 1.2402 |
| 3PXR | A | 16 | 1.2422 |
| 2DUV | A | 17 | 1.2534 |
| 1W8C | A | 18 | 1.2586 |
| 3DOG | A | 19 | 1.2793 |
| 2V22 | A | 20 | 1.2822 |
| 2R3P | A | 21 | 1.2883 |
| 2V22 | C | 22 | 1.2963 |
| 3IG7 | A | 23 | 1.3207 |
| 2JGZ | A | 24 | 1.3275 |
| 2R64 | A | 25 | 1.3303 |
| 2WHB | A | 26 | 1.3381 |
| 2VTN | A | 27 | 1.3458 |
| 3LFN | A | 28 | 1.3476 |
| 2WIP | A | 29 | 1.3514 |
| 2BKZ | A | 30 | 1.3580 |
(a) Top 50 hits using DRC for a query protein structure from the EGF family on DATASET-B. Annotations in bold correspond to members of the EGF family, predicted proteins, or uncharacterized proteins
| Protein | Chain | Distance | Protein name annotation | Rank |
|---|---|---|---|---|
| 2a2q | L | 0.0000 |
| 1 |
| 2fir | L | 1.4925 |
| 2 |
| 2zp0 | L | 1.5545 |
| 3 |
| 1wtg | L | 1.5628 |
| 4 |
| 1wun | L | 1.5844 |
| 5 |
| 2b8o | L | 1.6305 |
| 6 |
| 2zwl | L | 1.6379 |
| 7 |
| 2zzu | L | 1.6536 |
| 8 |
| 1wqv | L | 1.6816 |
| 9 |
| 2ec9 | L | 1.6832 |
| 10 |
| 1dan | L | 1.7655 |
| 11 |
| 1wss | L | 1.7659 |
| 12 |
| 2puq | L | 1.7692 |
| 13 |
| 1fak | L | 1.7934 |
| 14 |
| 2b7d | L | 1.8024 |
| 15 |
| 6acn | A | 1.8061 | ACONITASE | 16 |
| 2aer | L | 1.8120 |
| 17 |
| 2aei | L | 1.8164 |
| 18 |
| 2flr | L | 1.8196 |
| 19 |
| 3ela | L | 1.8668 |
| 20 |
| 1z6j | L | 1.8859 |
| 21 |
| 2f9b | L | 1.9027 |
| 22 |
| 3phs | A | 1.9169 | CELL WALL SURFACE ANCHOR FAMILY PROTEIN | 23 |
| 3n54 | B | 1.9263 | SPORE GERMINATION PROTEIN B3 | 24 |
| 3qbp | B | 1.9264 | FUMARASE FUM | 25 |
| 3ma9 | L | 1.9367 | TRANSMEMBRANE GLYCOPROTEIN | 26 |
| 3mt0 | A | 1.9921 |
| 27 |
| 3lgu | A | 2.0004 | PROTEASE DEGS | 28 |
| 3m7i | A | 2.0071 | TRANSKETOLASE | 29 |
| 1qfk | L | 2.0096 |
| 30 |
| 3n9t | A | 2.0169 | PNPC | 31 |
| 3pxz | A | 2.0235 | CELL DIVISION PROTEIN KINASE 2 | 32 |
| 2flb | L | 2.0257 |
| 33 |
| 3lh1 | A | 2.0289 | PROTEASE DEGS | 34 |
| 3nlc | A | 2.0300 |
| 35 |
| 3no5 | C | 2.0302 |
| 36 |
| 3msq | C | 2.0347 | PUTATIVE UBIQUINONE BIOSYNTHESIS PROTEIN | 37 |
| 3ryk | A | 2.0389 | DTDP-4-DEHYDRORHSMNOSE 3,5-EPIMERASE | 38 |
| 3m4a | A | 2.0444 | PUTATIVE TYPE I TOPOISOMERASE | 39 |
| 3n3n | B | 2.0444 | CATALASE-PEROXIDASE | 40 |
| 2R3G | A | 2.0456 | CELL DIVISION PROTEIN KINASE 2 | 41 |
| 2R3I | A | 2.0465 | CELL DIVISION PROTEIN KINASE 2 | 42 |
| 3o0r | L | 2.0473 | ANTIBODY FAB FRAGMENT LIGHT CHAIN | 43 |
| 3n3p | B | 2.0490 | CATALASE-PEROXIDASE | 44 |
| 3nfh | A | 2.0492 | DNA-DIRECTED RNA POLYMERASE I SUBUNIT RPA49 | 45 |
| 3qfk | A | 2.0501 |
| 46 |
| 3n5h | F | 2.0517 | FARNESYL PYROPHOSPHATE SYNTHASE | 47 |
| 3n3o | B | 2.0529 | CATALASE-PEROXIDASE | 48 |
| 3o78 | B | 2.0548 | CHIMERA PROTEIN OF PEPTIDE OF MYOSIN LIGHT CHAIN SMOOTH MUSCLE, GREEN FLUORESCENT PROTEIN, GREEN FLUORESCENT CALMODULIN | 49 |
| 3luy | A | 2.0570 | PROBABLE CHORISMATE MUTASE | 50 |
(b) Top 50 hits using DRC for a query protein structure from the COX-2 family on Dataset-B. Annotations in bold correspond to members of the COX-2 family, predicted proteins, or uncharacterized proteins
| Protein | Chain | Distance | Protein name annotation | Rank |
|---|---|---|---|---|
| 2zxw | B | 0.0000 |
| 1 |
| 2eil | B | 1.3914 |
| 2 |
| 2eij | B | 1.5853 |
| 3 |
| 3ag4 | B | 1.6147 |
| 4 |
| 2dys | B | 1.6991 |
| 5 |
| 3ag1 | B | 1.8159 |
| 6 |
| 2eim | B | 1.8824 |
| 7 |
| 3ag2 | B | 2.0173 |
| 8 |
| 2occ | B | 2.0631 |
| 9 |
| 3abl | B | 2.1404 |
| 10 |
| 2eik | B | 2.1413 |
| 11 |
| 3abm | B | 2.1454 |
| 12 |
| 1v55 | B | 2.1489 |
| 13 |
| 2dyr | B | 2.2044 |
| 14 |
| 2ein | B | 2.2628 |
| 15 |
| 1v54 | B | 2.2976 |
| 16 |
| 3n56 | B | 2.4226 | INSULIN-DEGRADING ENZYME | 17 |
| 3abk | B | 2.4502 |
| 18 |
| 3p42 | A | 2.4785 |
| 19 |
| 3r2u | B | 2.4846 | METALLO-BETA-LACTAMASE FAMILY PROTEIN | 20 |
| 3msu | B | 2.4920 | CITRATE SYNTHASE | 21 |
| 3ag3 | B | 2.4950 |
| 22 |
| 3ntd | B | 2.5052 | FAD-DEPENDENT PYRIDINE NUCLEOTIDE-DISULPHIDE OXIDOREDUCTASE | 23 |
| 3ngi | A | 2.5055 | DNA POLYMERASE | 24 |
| 7xim | B | 2.5198 | D-XYLOSE ISOMERASE | 25 |
| 3mjy | A | 2.5206 | DIHYDROOROTATE DEHYDROGENASE | 26 |
| 3nva | B | 2.5391 | CTP SYNTHASE | 27 |
| 3lm3 | A | 2.5433 |
| 28 |
| 3ppn | B | 2.5511 | GLYCINE BETAINE/CARNITINE/CHOLINE-BINDING PROTEIN | 29 |
| 3o98 | B | 2.5557 | BIFUNCTIONAL GLUTATHIONYLSPERMIDINE SYNTHETASE/AM | 30 |
| 3pom | B | 2.5558 | RETINOBLASTOMA-ASSOCIATED PROTEIN | 31 |
| 3nt6 | B | 2.5656 | FAD-DEPENDENT PYRIDINE NUCLEOTIDE-DISULPHIDE OXIDOREDUCTASE | 32 |
| 5lym | B | 2.5690 | LYSOZYME | 33 |
| 3n1y | B | 2.5728 | TOLUENE O-XYLENE MONOOXYGENASE COMPONENT | 34 |
| 1occ | B | 2.5772 |
| 35 |
| 3lxt | D | 2.5860 | GLUTATHIONE S TRANSFERASE | 36 |
| 2q70 | B | 2.5922 | ESTROGEN RECEPTOR | 37 |
| 3l49 | B | 2.5930 | ABC SUGAR (RIBOSE) TRANSPORTER, PERIPLASMIC SUBSTRATE-BINDING SUBUNIT | 38 |
| 3pvq | A | 2.5944 | DIPEPTIDYL-PEPTIDASE VI | 39 |
| 3puf | B | 2.6032 | RIBONUCLEASE H2 SUBUNIT B | 40 |
| 3mve | B | 2.6077 | UPF0255 PROTEIN VV1_0328 | 41 |
| 3ld2 | B | 2.6143 | PUTATIVE ACETYLTRANSFERASE | 42 |
| 3ne6 | A | 2.6160 | DNA POLYMERASE | 43 |
| 3qae | A | 2.6179 | 3-HYDROXY-3-METHYLGLUTARYL-COENZYME A REDUCTASE | 44 |
| 3qh8 | A | 2.6197 | BETA-LACTAMASE-LIKE | 45 |
| 3m3r | A | 2.6215 | ALPHA-HEMOLYSIN | 46 |
| 3nrb | B | 2.6234 | FORMYLTETRAHYDROFOLATE DEFORMYLASE | 47 |
| 3n05 | B | 2.6240 | NH(3)-DEPENDENT NAD(+) SYNTHETASE | 48 |
| 3m2l | A | 2.6324 | ALPHA-HEMOLYSIN | 49 |
| 3pns | B | 2.6357 | URIDINE PHOSPHORYLASE | 50 |
(a) Overall classification rate using different classifiers (300 training samples, 100 testing samples from DATASET-A)
| Classifier | Descriptor | |||
| DD1 | DD2 | RC | DRC | |
|
| ||||
| Naïve bayes | 58% | 86% | 94% | 91% |
| Logistic | 58% | 85% | 99% | 97% |
| Simple logistic | 58% | 89% | 98% | 91% |
(b) Overall classification rate using different classifiers (100 training samples, 300 testing samples from DATASET-A)
| Classifier | Descriptor | |||
| DD1 | DD2 | RC | DRC | |
|
| ||||
| Naïve bayes | 55% | 74% | 94% | 93% |
| Logistic | 62% | 88% | 89% | 94% |
| Simple logistic | 63% | 85% | 91% | 90% |