| Literature DB >> 14992521 |
J Huan1, W Wang, A Washington, J Prins, R Shah, A Tropsha.
Abstract
Protein structural annotation and classification is an important problem in bioinformatics. We report on the development of an efficient subgraph mining technique and its application to finding characteristic substructural patterns within protein structural families. In our method, protein structures are represented by graphs where the nodes are residues and the edges connect residues found within certain distance from each other. Application of subgraph mining to proteins is challenging for a number reasons: (1) protein graphs are large and complex, (2) current protein databases are large and continue to grow rapidly, and (3) only a small fraction of the frequent subgraphs among the huge pool of all possible subgraphs could be significant in the context of protein classification. To address these challenges, we have developed an information theoretic model called coherent subgraph mining. From information theory, the entropy of a random variable X measures the information content carried by X and the Mutual Information (MI) between two random variables X and Y measures the correlation between X and Y. We define a subgraph X as coherent if it is strongly correlated with every sufficiently large sub-subgraph Y embedded in it. Based on the MI metric, we have designed a search scheme that only reports coherent subgraphs. To determine the significance of coherent protein subgraphs, we have conducted an experimental study in which all coherent subgraphs were identified in several protein structural families annotated in the SCOP database (Murzin et al, 1995). The Support Vector Machine algorithm was used to classify proteins from different families under the binary classification scheme. We find that this approach identifies spatial motifs unique to individual SCOP families and affords excellent discrimination between families.Mesh:
Substances:
Year: 2004 PMID: 14992521 DOI: 10.1142/9789812704856_0039
Source DB: PubMed Journal: Pac Symp Biocomput ISSN: 2335-6928