Literature DB >> 10902176

Visual management of large scale data mining projects.

I Shah1, L Hunter.   

Abstract

This paper describes a unified framework for visualizing the preparations for, and results of, hundreds of machine learning experiments. These experiments were designed to improve the accuracy of enzyme functional predictions from sequence, and in many cases were successful. Our system provides graphical user interfaces for defining and exploring training datasets and various representational alternatives, for inspecting the hypotheses induced by various types of learning algorithms, for visualizing the global results, and for inspecting in detail results for specific training sets (functions) and examples (proteins). The visualization tools serve as a navigational aid through a large amount of sequence data and induced knowledge. They provided significant help in understanding both the significance and the underlying biological explanations of our successes and failures. Using these visualizations it was possible to efficiently identify weaknesses of the modular sequence representations and induction algorithms which suggest better learning strategies. The context in which our data mining visualization toolkit was developed was the problem of accurately predicting enzyme function from protein sequence data. Previous work demonstrated that approximately 6% of enzyme protein sequences are likely to be assigned incorrect functions on the basis of sequence similarity alone. In order to test the hypothesis that more detailed sequence analysis using machine learning techniques and modular domain representations could address many of these failures, we designed a series of more than 250 experiments using information-theoretic decision tree induction and naive Bayesian learning on local sequence domain representations of problematic enzyme function classes. In more than half of these cases, our methods were able to perfectly discriminate among various possible functions of similar sequences. We developed and tested our visualization techniques on this application.

Entities:  

Mesh:

Substances:

Year:  2000        PMID: 10902176      PMCID: PMC2709531          DOI: 10.1142/9789814447331_0026

Source DB:  PubMed          Journal:  Pac Symp Biocomput        ISSN: 2335-6928


  7 in total

1.  The SWISS-PROT protein sequence data bank.

Authors:  A Bairoch; B Boeckmann
Journal:  Nucleic Acids Res       Date:  1992-05-11       Impact factor: 16.971

2.  Identification of divergent functions in homologous proteins by induction over conserved modules.

Authors:  I Shah; L Hunter
Journal:  Proc Int Conf Intell Syst Mol Biol       Date:  1998

3.  Visualization based on the Enzyme Commission nomenclature.

Authors:  I Shah; L Hunter
Journal:  Pac Symp Biocomput       Date:  1998

4.  Predicting enzyme function from sequence: a systematic appraisal.

Authors:  I Shah; L Hunter
Journal:  Proc Int Conf Intell Syst Mol Biol       Date:  1997

5.  Pfam: a comprehensive database of protein domain families based on seed alignments.

Authors:  E L Sonnhammer; S R Eddy; R Durbin
Journal:  Proteins       Date:  1997-07

6.  The ENZYME data bank.

Authors:  A Bairoch
Journal:  Nucleic Acids Res       Date:  1994-09       Impact factor: 16.971

7.  Modular arrangement of proteins as inferred from analysis of homology.

Authors:  E L Sonnhammer; D Kahn
Journal:  Protein Sci       Date:  1994-03       Impact factor: 6.725

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.